Abstract:Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose $\textbf{P}$r$\textbf{o}$tein $\textbf{E}$volutionary $\textbf{T}$ransformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to generalize well even for small families. This is enabled by a unique Transformer layer; we model tokens sequentially within sequences while attending between sequences order invariantly, allowing PoET to scale to context lengths beyond those used during training. In extensive experiments on deep mutational scanning datasets, we show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all MSA depths. We also demonstrate PoET's ability to controllably generate new protein sequences.

What problem does this paper attempt to address?

The paper attempts to address the limitations of existing protein language models in designing new proteins with specific functions. Specifically: 1. **Family-specific models**: These models need to be trained on multiple sequence alignments (MSA) of specific protein families, making it difficult for them to benefit from other protein families and perform poorly on small family proteins. 2. **Unconditional protein language models**: While these models can learn from all known natural protein sequences, they struggle to generate proteins from specific families and are not as effective as family-specific models in relative fitness prediction. 3. **Hybrid models**: Models like Tranception and TranceptEVE combine unconditional language models and family-specific models, but they still face difficulties in generating new insertions or deletions (indels), and the predictions of family-specific models cannot directly benefit from cross-family transfer learning. To overcome these limitations, the paper proposes the **ProteinEvolutionary Transformer (PoET)**, an autoregressive generative model capable of generating related protein sets for entire protein families as sequences of sequences. The main features of PoET include: - **Generalization ability**: By learning from tens of millions of natural protein sequence clusters, PoET can generalize evolutionary processes across different protein families, avoiding the dependency on MSA. - **Order independence**: PoET uses a unique Transformer layer that can model within-sequence order while being order-invariant between sequences, allowing it to scale beyond the context length used during training. - **Controllable generation**: PoET can be used to generate and score arbitrary modifications, including new insertions and deletions, not just substitutions. - **Efficiency**: PoET can be used as a retrieval-augmented language model by conditioning on any protein family of interest, efficiently generating and scoring protein sequences. Through extensive experiments, the paper demonstrates PoET's superior performance in variant effect prediction and generating new protein sequences, especially when dealing with sequences containing a large number of mutations and small family proteins.

PoET: A generative model of protein families as sequences-of-sequences

Generative power of a protein language model trained on multiple sequence alignments

Few Shot Protein Generation

ProtGPT2 is a deep unsupervised language model for protein design

Biophysics-based protein language models for protein engineering

Evolutionary context-integrated deep sequence modeling for protein engineering

Peptide-GPT: Generative Design of Peptides using Generative Pre-trained Transformers and Bio-informatic Supervision

Learning the protein language: Evolution, structure, and function

Exploring the Protein Sequence Space with Global Generative Models

Endowing Protein Language Models with Structural Knowledge

PEvoLM: Protein Sequence Evolutionary Information Language Model

Generative models for protein structures and sequences

Language models generalize beyond natural proteins

Using Genetic Programming to Predict and Optimize Protein Function

ProtChatGPT: Towards Understanding Proteins with Large Language Models

Exploring evolution-aware & -free protein language models as protein function predictors

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

PEST: A General-Purpose Protein Embedding Model for Homology Search

FoldToken: Learning Protein Language via Vector Quantization and Beyond

Atom-by-atom protein generation and beyond with language models