PoET: A generative model of protein families as sequences-of-sequences

Timothy F. Truong Jr,Tristan Bepler
2023-11-01
Abstract:Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose $\textbf{P}$r$\textbf{o}$tein $\textbf{E}$volutionary $\textbf{T}$ransformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to generalize well even for small families. This is enabled by a unique Transformer layer; we model tokens sequentially within sequences while attending between sequences order invariantly, allowing PoET to scale to context lengths beyond those used during training. In extensive experiments on deep mutational scanning datasets, we show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all MSA depths. We also demonstrate PoET's ability to controllably generate new protein sequences.
Quantitative Methods,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the limitations of existing protein language models in designing new proteins with specific functions. Specifically: 1. **Family-specific models**: These models need to be trained on multiple sequence alignments (MSA) of specific protein families, making it difficult for them to benefit from other protein families and perform poorly on small family proteins. 2. **Unconditional protein language models**: While these models can learn from all known natural protein sequences, they struggle to generate proteins from specific families and are not as effective as family-specific models in relative fitness prediction. 3. **Hybrid models**: Models like Tranception and TranceptEVE combine unconditional language models and family-specific models, but they still face difficulties in generating new insertions or deletions (indels), and the predictions of family-specific models cannot directly benefit from cross-family transfer learning. To overcome these limitations, the paper proposes the **ProteinEvolutionary Transformer (PoET)**, an autoregressive generative model capable of generating related protein sets for entire protein families as sequences of sequences. The main features of PoET include: - **Generalization ability**: By learning from tens of millions of natural protein sequence clusters, PoET can generalize evolutionary processes across different protein families, avoiding the dependency on MSA. - **Order independence**: PoET uses a unique Transformer layer that can model within-sequence order while being order-invariant between sequences, allowing it to scale beyond the context length used during training. - **Controllable generation**: PoET can be used to generate and score arbitrary modifications, including new insertions and deletions, not just substitutions. - **Efficiency**: PoET can be used as a retrieval-augmented language model by conditioning on any protein family of interest, efficiently generating and scoring protein sequences. Through extensive experiments, the paper demonstrates PoET's superior performance in variant effect prediction and generating new protein sequences, especially when dealing with sequences containing a large number of mutations and small family proteins.