Structure-conditioned masked language models for protein sequence design generalize beyond the native sequence space

Deniz Akpinaroglu,Kosuke Seki,Amy Guo,Eleanor Zhu,Mark J. S. Kelly,Tanja Kortemme
DOI: https://doi.org/10.1101/2023.12.15.571823
2023-12-18
Abstract:Machine learning has revolutionized computational protein design, enabling significant progress in protein backbone generation and sequence design. Here, we introduce Frame2seq, a structure-conditioned masked language model for protein sequence design. Frame2seq generates sequences in a single pass, achieves 49.1% sequence recovery on the CATH 4.2 test dataset, and accurately estimates the error in its own predictions, outperforming the autoregressive ProteinMPNN model with over six times faster inference. To probe the ability of Frame2seq to generate novel designs beyond the native-like sequence space it was trained on, we experimentally test 26 Frame2seq designs for de novo backbones with low identity to the starting sequences. We show that Frame2seq successfully designs soluble (22/26), monomeric, folded, and stable proteins (17/26), including a design with 0% sequence identity to native. The speed and accuracy of Frame2seq will accelerate exploration of novel sequence space across diverse design tasks, including challenging applications such as multi-objective optimization.
What problem does this paper attempt to address?