Abstract:Large Language Models (LLMs) have achieved remarkable performance in multiple Natural Language Processing (NLP) tasks. Under the premise that protein sequences constitute the protein language, Protein Language Models(PLMs) have advanced the field of protein engineering. However, as of now, unlike LLMs in NLP, PLMs cannot handle the protein understanding task and the protein generation task simultaneously in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current PLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a PLM capable of handling multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset containing 13 million samples with superfamily information, facilitating better modeling of protein sequence-function landscapes. Through these methods, we develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. As for the protein understanding task, ProLLaMA achieves a 62\% exact match rate in superfamily prediction. Codes, model weights, and datasets are available at \url{<a class="link-external link-https" href="https://github.com/PKU-YuanGroup/ProLLaMA" rel="external noopener nofollow">this https URL</a>} and \url{<a class="link-external link-https" href="https://huggingface.co/GreatCaptainNemo" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper aims to address the current limitations of Protein Language Models (PLMs) in handling multi-task Protein Language Processing (PLP). Specifically, the paper proposes solutions to the following three main challenges: 1. **Lack of natural language capability**: Current PLMs cannot fully express all components of protein language processing tasks (task instructions, inputs, and expected outputs), thus requiring the aid of natural language to compensate for this shortcoming. 2. **Insufficient instruction understanding**: To achieve multi-task processing capability, the model needs to execute tasks based on user instructions, but existing PLMs lack this instruction-based task execution ability. 3. **High training resource demand**: Enabling the model to learn natural language, protein language, and user instructions simultaneously requires a large amount of training resources, which is sometimes unaffordable. To address these issues, the authors propose a training framework that can transform any general large-scale language model (LLMs) into a PLM capable of handling various PLP tasks. Additionally, a Protein Vocabulary Pruning (PVP) method is introduced to improve training efficiency, and a multi-task instruction dataset containing approximately 13 million samples is constructed to better simulate the relationship between protein sequences and functions. Through these methods, the authors developed the ProLLaMA model, the first model capable of performing well on multiple PLP tasks, including unconditional protein generation, controllable protein generation, and protein understanding tasks. Experimental results show that ProLLaMA can handle various PLP tasks and achieve state-of-the-art levels in protein generation tasks. Particularly in controllable protein generation, ProLLaMA can generate new proteins with desired functions based on user-provided text descriptions, demonstrating its potential application value in the field of protein design.

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

PLM-interact: extending protein language models to predict protein-protein interactions

Long-context Protein Language Model

Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation

Does protein pretrained language model facilitate the prediction of protein–ligand interaction?

PLLaMa: An Open-source Large Language Model for Plant Science

S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

Multi-Modal Large Language Model Enables Protein Function Prediction

InstructProtein: Aligning Human and Protein Language via Knowledge Instruction

MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering

DPLM-2: A Multimodal Diffusion Protein Language Model

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

Modeling Protein Using Large-scale Pretrain Language Model

Efficient Inference, Training, and Fine-tuning of Protein Language Models

THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model