Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

Kamyar Zeinalipour,Neda Jamshidi,Monica Bianchini,Marco Maggini,Marco Gori
2024-08-12
Abstract:Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.
Quantitative Methods,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper primarily explores how to utilize large language models (LLMs) to generate high-quality protein sequences and enhances and compares the application of these models in the field of protein design. The core objectives of the paper include: 1. **Exploring the capabilities of medium-scale language models**: The study focuses on medium-scale language models with parameters ranging between 700 million and 800 million, such as Mistral-7B, Llama-2-7B, Llama-3-8B, and gemma-7B, evaluating their effectiveness in generating functionally viable protein sequences. 2. **Adapting to small datasets**: By using a small dataset containing approximately 42,000 human protein sequences to train these models, it demonstrates that efficient performance can be achieved even with limited data. 3. **Conducting comparative analysis**: Comparing these models with existing protein design-specific models (such as ProGen, ProtGPT2, and ProLLaMA) to quantitatively and qualitatively evaluate their performance and effectiveness. 4. **Releasing trained models**: Committing to making all four trained language models freely available to the scientific community to promote transparency and collaboration in the field. The methodology section of the paper details the training process of the models, including: - **Retraining the tokenizer**: Using the Byte-Pair Encoding (BPE) method to adjust the tokenizer for better handling of protein sequence data. - **Fine-tuning pre-trained models**: Fine-tuning the selected four models to enhance their ability to predict protein sequences. - **Model evaluation**: Using AlphaFold2 to predict the generated protein structures and evaluating the quality of the model outputs through various metrics such as pLDDT, RMSD, TM-Score, and REU. Experimental results show that despite using a small dataset, the trained models are still capable of effectively generating high-quality protein sequences and perform comparably to existing models trained on large datasets. Additionally, the paper provides detailed experimental setups and result analyses, including the selection of datasets for protein sequence generation, training configurations, and specific evaluation methods.