Abstract:Protein sequence design, the inverse problem of protein structure prediction, plays a crucial role in protein engineering. Although recent deep learning-based methods have shown promising advancements, achieving accurate and robust protein sequence design remains an ongoing challenge. Here, we present CarbonDesign, a new approach that draws inspiration from successful ingredients of AlphaFold for protein structure prediction and makes significant and novel developments tailored specifically for protein sequence design. At its core, CarbonDesign explores Inverseformer, a novel network architecture adapted from AlphaFold’s Evoformer, to learn representations from backbone structures and an amortized Markov Random Fields model for sequence decoding. Moreover, we incorporate other essential AlphaFold concepts into CarbonDesign: an end-to-end network recycling technique to leverage evolutionary constraints in protein language models and a multi-task learning technique to generate side chain structures corresponding to the designed sequences. Through rigorous evaluations on independent testing data sets, including the CAMEO and recent CASP15 data sets, as well as the predicted structures from AlphaFold, we show that CarbonDesign outperforms other published methods, achieving high accuracy in sequence generation. Moreover, it exhibits superior performance on de novo backbone structures obtained from recent diffusion generative models such as RFdiffusion and FrameDiff, highlighting its potential for enhancing de novo protein design. Notably, CarbonDesign also supports zero-shot prediction of the functional effects of sequence variants, indicating its potential application in directed evolution-based design. In summary, our results illustrate CarbonDesign’s accurate and robust performance in protein sequence design, making it a promising tool for applications in bioengineering.

Progressive Multi-Modality Learning for Inverse Protein Folding

Global-Context Aware Generative Protein Design

Generative De Novo Protein Design with Global Context

Towards deep learning sequence-structure co-generation for protein design

Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design

Protein sequence design on given backbones with deep learning

DNDesign: Enhancing Physical Understanding of Protein Inverse Folding Model via Denoising

Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX

Highly accurate and robust protein sequence design with CarbonDesign

An integrative approach to protein sequence design through multiobjective optimization

Generating All-Atom Protein Structure from Sequence-Only Training Data

Inverse Protein Folding Using Deep Bayesian Optimization

Robust deep learning based protein sequence design using ProteinMPNN

SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation

Accurate and robust protein sequence design with CarbonDesign

Structure-based protein design with deep learning

Protein Design by Integrating Machine Learning with Quantum Annealing and Quantum-inspired Optimization

DPLM-2: A Multimodal Diffusion Protein Language Model

Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation

Protein Language Model Supervised Precise and Efficient Protein Backbone Design Method

Protein Language Model Supervised Scalable Approach for Diverse and Designable Protein Motif-Scaffolding with GPDL