Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Mahdi Pourmirzaei,Farzaneh Esmaili,Mohammadreza Pourmirzaei,Duolin Wang,Dong Xu
DOI: https://doi.org/10.1101/2024.05.31.596915
2024-06-03
Abstract:This paper proposes a versatile tokenization method and introduces Prot2Token, a model that combines autoregressive language modeling with protein language models (PLMs) to tackle various protein prediction tasks using protein sequences. Leveraging our tokenization method, Prot2Token adapts existing PLMs for multiple tasks such as protein-level prediction, residue-level prediction, and protein-protein interaction prediction through next-token prediction of tokenized target label sequences. By incorporating prompt tokens into the decoder, Prot2Token enables multi-task training in a single end-to-end session. Our results demonstrate that Prot2Token not only matches the performance of specialized models across various tasks but also paves the way for integrating protein tasks with large language models (LLMs), representing an important step towards creating general-purpose PLMs for advanced protein language processing (PLP). Additionally, we use Prot2Token to develop S-ESM, a structure-aware version of the ESM model, which achieves competitive performance with state-of-the-art methods in 3D structure-related tasks using only protein sequences. Code is available at: \url{https://github.com/mahdip72/prot2token}.
Bioinformatics
What problem does this paper attempt to address?
The paper presents a multi-task framework called Prot2Token for protein language processing. It combines autoregressive language modeling with protein language models (PLMs) to handle various prediction tasks based on protein sequences. Existing PLMs are usually designed for specific tasks and require separate architecture and training for each task, which is time-consuming and computationally expensive. Prot2Token achieves integrated multi-task training by converting the target labels of different tasks into predictable token sequences, reducing the need for annotated training data. The main contributions of the paper are as follows: 1. It proposes a novel tokenization strategy for protein prediction tasks, covering tasks at the protein-level, residue-level, and protein-protein interaction prediction. 2. It designs the Prot2Token model, which can be combined with pre-trained PLMs for end-to-end single-task or multi-task learning, and performs comparably to specialized models in multiple tasks. 3. It develops a structure-aware version of the ESM model (S-ESM) using Prot2Token, achieving competitive results in 3D structure-related tasks using only protein sequences. Furthermore, Prot2Token enhances the structural awareness of existing PLMs by predicting 3D structural tokens, paving the way for creating universal protein language models for advanced protein language processing. The paper also demonstrates that integrating auxiliary tasks can improve the effectiveness of Prot2Token in tasks with limited data samples.