Abstract:Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.

What problem does this paper attempt to address?

The paper primarily explores how to utilize large language models (LLMs) to generate high-quality protein sequences and enhances and compares the application of these models in the field of protein design. The core objectives of the paper include: 1. **Exploring the capabilities of medium-scale language models**: The study focuses on medium-scale language models with parameters ranging between 700 million and 800 million, such as Mistral-7B, Llama-2-7B, Llama-3-8B, and gemma-7B, evaluating their effectiveness in generating functionally viable protein sequences. 2. **Adapting to small datasets**: By using a small dataset containing approximately 42,000 human protein sequences to train these models, it demonstrates that efficient performance can be achieved even with limited data. 3. **Conducting comparative analysis**: Comparing these models with existing protein design-specific models (such as ProGen, ProtGPT2, and ProLLaMA) to quantitatively and qualitatively evaluate their performance and effectiveness. 4. **Releasing trained models**: Committing to making all four trained language models freely available to the scientific community to promote transparency and collaboration in the field. The methodology section of the paper details the training process of the models, including: - **Retraining the tokenizer**: Using the Byte-Pair Encoding (BPE) method to adjust the tokenizer for better handling of protein sequence data. - **Fine-tuning pre-trained models**: Fine-tuning the selected four models to enhance their ability to predict protein sequences. - **Model evaluation**: Using AlphaFold2 to predict the generated protein structures and evaluating the quality of the model outputs through various metrics such as pLDDT, RMSD, TM-Score, and REU. Experimental results show that despite using a small dataset, the trained models are still capable of effectively generating high-quality protein sequences and perform comparably to existing models trained on large datasets. Additionally, the paper provides detailed experimental setups and result analyses, including the selection of datasets for protein sequence generation, training configurations, and specific evaluation methods.

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

Modeling Protein Using Large-scale Pretrain Language Model

Language models generalize beyond natural proteins

Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation

Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX

ProGen2: Exploring the Boundaries of Protein Language Models

Designing proteins with language models

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

Structure-informed Language Models Are Protein Designers

Evaluating large language models for annotating proteins

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

Bilingual Language Model for Protein Sequence and Structure

Protein Language Models: Is Scaling Necessary?

ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning

Training Compute-Optimal Protein Language Models

Reinforcement Learning for Sequence Design Leveraging Protein Language Models

Robust deep learning based protein sequence design using ProteinMPNN

LA4SR: illuminating the dark proteome with generative AI