ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

Liuzhenghao Lv,Zongying Lin,Hao Li,Yuyang Liu,Jiaxi Cui,Calvin Yu-Chian Chen,Li Yuan,Yonghong Tian
2024-07-16
Abstract:Large Language Models (LLMs) have achieved remarkable performance in multiple Natural Language Processing (NLP) tasks. Under the premise that protein sequences constitute the protein language, Protein Language Models(PLMs) have advanced the field of protein engineering. However, as of now, unlike LLMs in NLP, PLMs cannot handle the protein understanding task and the protein generation task simultaneously in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current PLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a PLM capable of handling multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset containing 13 million samples with superfamily information, facilitating better modeling of protein sequence-function landscapes. Through these methods, we develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. As for the protein understanding task, ProLLaMA achieves a 62\% exact match rate in superfamily prediction. Codes, model weights, and datasets are available at \url{<a class="link-external link-https" href="https://github.com/PKU-YuanGroup/ProLLaMA" rel="external noopener nofollow">this https URL</a>} and \url{<a class="link-external link-https" href="https://huggingface.co/GreatCaptainNemo" rel="external noopener nofollow">this https URL</a>}.
Computational Engineering, Finance, and Science,Biomolecules
What problem does this paper attempt to address?
The paper aims to address the current limitations of Protein Language Models (PLMs) in handling multi-task Protein Language Processing (PLP). Specifically, the paper proposes solutions to the following three main challenges: 1. **Lack of natural language capability**: Current PLMs cannot fully express all components of protein language processing tasks (task instructions, inputs, and expected outputs), thus requiring the aid of natural language to compensate for this shortcoming. 2. **Insufficient instruction understanding**: To achieve multi-task processing capability, the model needs to execute tasks based on user instructions, but existing PLMs lack this instruction-based task execution ability. 3. **High training resource demand**: Enabling the model to learn natural language, protein language, and user instructions simultaneously requires a large amount of training resources, which is sometimes unaffordable. To address these issues, the authors propose a training framework that can transform any general large-scale language model (LLMs) into a PLM capable of handling various PLP tasks. Additionally, a Protein Vocabulary Pruning (PVP) method is introduced to improve training efficiency, and a multi-task instruction dataset containing approximately 13 million samples is constructed to better simulate the relationship between protein sequences and functions. Through these methods, the authors developed the ProLLaMA model, the first model capable of performing well on multiple PLP tasks, including unconditional protein generation, controllable protein generation, and protein understanding tasks. Experimental results show that ProLLaMA can handle various PLP tasks and achieve state-of-the-art levels in protein generation tasks. Particularly in controllable protein generation, ProLLaMA can generate new proteins with desired functions based on user-provided text descriptions, demonstrating its potential application value in the field of protein design.