Abstract:Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at <a class="link-external link-https" href="https://github.com/THUDM/ProteinLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the problem of modeling and analyzing protein sequences. Specifically, the authors aim to capture biological and evolutionary information in protein sequences through large-scale pre-trained language models, thereby improving the performance of protein-related tasks. ### Main Research Background 1. **Importance of Proteins**: Proteins are involved in almost all life processes, so analyzing the biological structure and properties of proteins is crucial for exploring life, disease detection, and drug discovery. 2. **Limitations of Traditional Methods**: Traditional protein analysis methods are often time-consuming and labor-intensive, making it difficult to handle large-scale data. 3. **Application of Deep Learning**: With the development of deep learning models, researchers have begun to use these models to process large-scale biological datasets, such as using Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) for protein sequence classification. 4. **Utilization of Evolutionary Information**: Through millions of years of natural selection and evolution, protein sequences encode a large amount of evolutionary information. Inspired by the similarity between natural language and protein sequences, the authors propose using large-scale language models to model protein sequences on an evolutionary scale. ### Main Contributions of the Paper 1. **Large-Scale Pre-Trained Models**: The authors trained several large-scale models on the PFAM dataset, with the largest model having 3 billion parameters, significantly outperforming the TAPE benchmark models. 2. **Improvements in Downstream Tasks**: Significant performance improvements were achieved in four downstream tasks (secondary structure prediction, remote homology detection, stability, and contact prediction), especially in the contact prediction task, where the performance of the 3 billion parameter model was nearly twice that of the baseline model. 3. **Empirical Rules for Hyperparameter Selection**: Through extensive controlled experiments, the authors summarized some empirical rules for hyperparameter selection to balance training efficiency and resource consumption. ### Method Overview 1. **Pre-Training Task**: The masked language model (MLM) loss function was used, randomly masking 15% of the amino acid tokens, and training the model to predict the masked tokens. The next sentence prediction (NSP) loss was abandoned because there is no obvious contextual semantic relationship between protein sequences. 2. **Downstream Tasks**: - **Secondary Structure Prediction**: The input is the protein sequence, and the output is a label sequence representing the secondary structure position of each amino acid. - **Remote Homology Detection**: The input is the protein sequence, and the goal is to predict which fold family the sequence belongs to. - **Contact Prediction**: Predicting whether pairs of amino acids are "in contact" in the folded structure. - **Fluorescence Prediction**: Evaluating the model's ability to distinguish protein sequences with different mutations. - **Stability Prediction**: Predicting the extent to which a protein sequence can maintain its folded structure. ### Experimental Results 1. **Pre-Training Effect**: The 3 billion parameter model performed excellently in terms of masked language model loss (MLM loss) and perplexity (PPL). 2. **Downstream Task Performance**: In the four downstream tasks, the performance of the 3 billion parameter model significantly outperformed the baseline models, especially in the contact prediction task, where the performance nearly doubled. ### Conclusion The proposed ProteinLM model successfully captures biological information and long-term dependencies in protein sequences through large-scale pre-training, significantly improving the performance of downstream tasks. Additionally, through extensive experiments, the authors summarized some empirical rules for hyperparameter selection, providing valuable references for future protein sequence modeling.

Modeling Protein Using Large-scale Pretrain Language Model

Learning the protein language: Evolution, structure, and function

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX

Multi-Modal Large Language Model Enables Protein Function Prediction

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

When Protein Structure Embedding Meets Large Language Models

Training Compute-Optimal Protein Language Models

Modeling the language of life – Deep Learning Protein Sequences

When Geometric Deep Learning Meets Pretrained Protein Language Models.

Exploring Protein Conformational Changes Using a Large‐Scale Biophysical Sampling Augmented Deep Learning Strategy

Structure-Informed Protein Language Model

SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation

Large language models generate functional protein sequences across diverse families

ProtChatGPT: Towards Understanding Proteins with Large Language Models

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Deep Learning in Protein Structural Modeling and Design

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

Evolutionary-scale prediction of atomic-level protein structure with a language model

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein