Modeling Protein Using Large-scale Pretrain Language Model

Yijia Xiao,Jiezhong Qiu,Ziang Li,Chang-Yu Hsieh,Jie Tang
DOI: https://doi.org/10.48550/arXiv.2108.07435
2021-12-08
Abstract:Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at <a class="link-external link-https" href="https://github.com/THUDM/ProteinLM" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Biomolecules
What problem does this paper attempt to address?
The paper attempts to address the problem of modeling and analyzing protein sequences. Specifically, the authors aim to capture biological and evolutionary information in protein sequences through large-scale pre-trained language models, thereby improving the performance of protein-related tasks. ### Main Research Background 1. **Importance of Proteins**: Proteins are involved in almost all life processes, so analyzing the biological structure and properties of proteins is crucial for exploring life, disease detection, and drug discovery. 2. **Limitations of Traditional Methods**: Traditional protein analysis methods are often time-consuming and labor-intensive, making it difficult to handle large-scale data. 3. **Application of Deep Learning**: With the development of deep learning models, researchers have begun to use these models to process large-scale biological datasets, such as using Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) for protein sequence classification. 4. **Utilization of Evolutionary Information**: Through millions of years of natural selection and evolution, protein sequences encode a large amount of evolutionary information. Inspired by the similarity between natural language and protein sequences, the authors propose using large-scale language models to model protein sequences on an evolutionary scale. ### Main Contributions of the Paper 1. **Large-Scale Pre-Trained Models**: The authors trained several large-scale models on the PFAM dataset, with the largest model having 3 billion parameters, significantly outperforming the TAPE benchmark models. 2. **Improvements in Downstream Tasks**: Significant performance improvements were achieved in four downstream tasks (secondary structure prediction, remote homology detection, stability, and contact prediction), especially in the contact prediction task, where the performance of the 3 billion parameter model was nearly twice that of the baseline model. 3. **Empirical Rules for Hyperparameter Selection**: Through extensive controlled experiments, the authors summarized some empirical rules for hyperparameter selection to balance training efficiency and resource consumption. ### Method Overview 1. **Pre-Training Task**: The masked language model (MLM) loss function was used, randomly masking 15% of the amino acid tokens, and training the model to predict the masked tokens. The next sentence prediction (NSP) loss was abandoned because there is no obvious contextual semantic relationship between protein sequences. 2. **Downstream Tasks**: - **Secondary Structure Prediction**: The input is the protein sequence, and the output is a label sequence representing the secondary structure position of each amino acid. - **Remote Homology Detection**: The input is the protein sequence, and the goal is to predict which fold family the sequence belongs to. - **Contact Prediction**: Predicting whether pairs of amino acids are "in contact" in the folded structure. - **Fluorescence Prediction**: Evaluating the model's ability to distinguish protein sequences with different mutations. - **Stability Prediction**: Predicting the extent to which a protein sequence can maintain its folded structure. ### Experimental Results 1. **Pre-Training Effect**: The 3 billion parameter model performed excellently in terms of masked language model loss (MLM loss) and perplexity (PPL). 2. **Downstream Task Performance**: In the four downstream tasks, the performance of the 3 billion parameter model significantly outperformed the baseline models, especially in the contact prediction task, where the performance nearly doubled. ### Conclusion The proposed ProteinLM model successfully captures biological information and long-term dependencies in protein sequences through large-scale pre-training, significantly improving the performance of downstream tasks. Additionally, through extensive experiments, the authors summarized some empirical rules for hyperparameter selection, providing valuable references for future protein sequence modeling.