Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Mahdi Pourmirzaei,Farzaneh Esmaili,Mohammadreza Pourmirzaei,Duolin Wang,Dong Xu

DOI: https://doi.org/10.1101/2024.05.31.596915

2024-06-03

Abstract:This paper proposes a versatile tokenization method and introduces Prot2Token, a model that combines autoregressive language modeling with protein language models (PLMs) to tackle various protein prediction tasks using protein sequences. Leveraging our tokenization method, Prot2Token adapts existing PLMs for multiple tasks such as protein-level prediction, residue-level prediction, and protein-protein interaction prediction through next-token prediction of tokenized target label sequences. By incorporating prompt tokens into the decoder, Prot2Token enables multi-task training in a single end-to-end session. Our results demonstrate that Prot2Token not only matches the performance of specialized models across various tasks but also paves the way for integrating protein tasks with large language models (LLMs), representing an important step towards creating general-purpose PLMs for advanced protein language processing (PLP). Additionally, we use Prot2Token to develop S-ESM, a structure-aware version of the ESM model, which achieves competitive performance with state-of-the-art methods in 3D structure-related tasks using only protein sequences. Code is available at: \url{https://github.com/mahdip72/prot2token}.

Bioinformatics

What problem does this paper attempt to address?

The paper presents a multi-task framework called Prot2Token for protein language processing. It combines autoregressive language modeling with protein language models (PLMs) to handle various prediction tasks based on protein sequences. Existing PLMs are usually designed for specific tasks and require separate architecture and training for each task, which is time-consuming and computationally expensive. Prot2Token achieves integrated multi-task training by converting the target labels of different tasks into predictable token sequences, reducing the need for annotated training data. The main contributions of the paper are as follows: 1. It proposes a novel tokenization strategy for protein prediction tasks, covering tasks at the protein-level, residue-level, and protein-protein interaction prediction. 2. It designs the Prot2Token model, which can be combined with pre-trained PLMs for end-to-end single-task or multi-task learning, and performs comparably to specialized models in multiple tasks. 3. It develops a structure-aware version of the ESM model (S-ESM) using Prot2Token, achieving competitive results in 3D structure-related tasks using only protein sequences. Furthermore, Prot2Token enhances the structural awareness of existing PLMs by predicting 3D structural tokens, paving the way for creating universal protein language models for advanced protein language processing. The paper also demonstrates that integrating auxiliary tasks can improve the effectiveness of Prot2Token in tasks with limited data samples.

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

FoldToken4: Consistent & Hierarchical Fold Language

SaProt: Protein Language Modeling with Structure-aware Vocabulary

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

ProtRNA: A Protein-derived RNA Language Model by Cross-Modality Transfer Learning

Tokenizing Foldable Protein Structures with Machine-Learned Artificial Amino-Acid Vocabulary

Token-Mol 1.0: Tokenized drug design with large language model

From PSSM to Pre-Trained Language Models

DPLM-2: A Multimodal Diffusion Protein Language Model

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

FoldToken2: Learning compact, invariant and generative protein structure language

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

FoldToken: Learning Protein Language via Vector Quantization and Beyond

ProTokens: A Machine-Learned Language for Compact and Informative Encoding of Protein 3D Structures

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

Multi-Scale Protein Language Model for Unified Molecular Modeling

Efficient Inference, Training, and Fine-tuning of Protein Language Models

PROTGOAT : Improved automated protein function predictions using Protein Language Models

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

FoldToken4: Consistent &amp; Hierarchical Fold Language

SaProt: Protein Language Modeling with Structure-aware Vocabulary

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

ProtRNA: A Protein-derived RNA Language Model by Cross-Modality Transfer Learning

Tokenizing Foldable Protein Structures with Machine-Learned Artificial Amino-Acid Vocabulary

Token-Mol 1.0: Tokenized drug design with large language model

From PSSM to Pre-Trained Language Models

DPLM-2: A Multimodal Diffusion Protein Language Model

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

FoldToken2: Learning compact, invariant and generative protein structure language

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

FoldToken: Learning Protein Language via Vector Quantization and Beyond

ProTokens: A Machine-Learned Language for Compact and Informative Encoding of Protein 3D Structures

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

Multi-Scale Protein Language Model for Unified Molecular Modeling

Efficient Inference, Training, and Fine-tuning of Protein Language Models

PROTGOAT : Improved automated protein function predictions using Protein Language Models

FoldToken4: Consistent & Hierarchical Fold Language