Abstract:Inverse protein folding is challenging due to its inherent one-to-many mapping characteristic, where numerous possible amino acid sequences can fold into a single, identical protein backbone. This task involves not only identifying viable sequences but also representing the sheer diversity of potential solutions. However, existing discriminative models, such as transformer-based auto-regressive models, struggle to encapsulate the diverse range of plausible solutions. In contrast, diffusion probabilistic models, as an emerging genre of generative approaches, offer the potential to generate a diverse set of sequence candidates for determined protein backbones. We propose a novel graph denoising diffusion model for inverse protein folding, where a given protein backbone guides the diffusion process on the corresponding amino acid residue types. The model infers the joint distribution of amino acids conditioned on the nodes' physiochemical properties and local environment. Moreover, we utilize amino acid replacement matrices for the diffusion forward process, encoding the biologically-meaningful prior knowledge of amino acids from their spatial and sequential neighbors as well as themselves, which reduces the sampling space of the generative process. Our model achieves state-of-the-art performance over a set of popular baseline methods in sequence recovery and exhibits great potential in generating diverse protein sequences for a determined protein backbone structure.

What problem does this paper attempt to address?

This paper aims to address the problem of inverse protein folding, which is a challenging task because the mapping from protein structure to amino acid sequence is fundamentally a many-to-one relationship. Specifically, many possible amino acid sequences can fold into the same protein backbone structure. Existing methods, such as Transformer-based autoregressive models, face difficulties in capturing this diversity and non-uniqueness. The paper proposes a novel Graph Denoising Diffusion Model (GRADE-IF) for reverse protein folding. The core of the model is to guide the diffusion process on the corresponding amino acid residue types given a protein backbone. The model infers the joint distribution of amino acids conditioned on node's physicochemical properties and local environment. Additionally, the model utilizes an amino acid substitution matrix to encode prior knowledge about the biological significance of amino acids acquired from their spatial and sequential neighbors and themselves during the diffusion forward process, reducing the sampling space of the generation process. The main contributions of the paper include: 1. The proposal of GRADE-IF model, a diffusion model based on rotation-invariant graph neural networks, specifically tailored for reverse folding tasks, capable of generating a wide range of sequence candidates. 2. In discrete diffusion models, uniform noise is commonly used. However, this study innovatively uses the Block Substitution Matrix (BLOSUM) as the translation kernel, encoding prior knowledge about the amino acid response to evolutionary pressure. 3. To accelerate the sampling process, the study adopts a discrete version of the Denoising Diffusion Implicit Model (DDIM) and provides comprehensive theoretical analysis. Experimental results demonstrate that the GRADE-IF model achieves state-of-the-art performance in a series of benchmark tests, significantly improving sequence recovery rates compared to existing methods, particularly in the recovery of conservative regions with biological importance. Additionally, the generated sequence structures by the model are highly consistent with native sequence structures, confirming its ability to generate biologically feasible new sequences given a protein structure.

Graph Denoising Diffusion for Inverse Protein Folding

Mask prior-guided denoising diffusion improves inverse protein folding

Protein structure generation via folding diffusion

Diffusion Model with Representation Alignment for Protein Inverse Folding

LaGDif: Latent Graph Diffusion Model for Efficient Protein Inverse Folding with Self-Ensemble

Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models

De novo protein design with a denoising diffusion network independent of pretrained structure prediction models

DNDesign: Enhancing Physical Understanding of Protein Inverse Folding Model via Denoising

Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds

Improving diffusion-based protein backbone generation with global-geometry-aware latent encoding

Inverse Protein Folding Using Deep Bayesian Optimization

Fast non-autoregressive inverse folding with discrete diffusion

EigenFold: Generative Protein Structure Prediction with Diffusion Models

TopoDiff: Improving Protein Backbone Generation with Topology-aware Latent Encoding

A Latent Diffusion Model for Protein Structure Generation

PiFold: Toward effective and efficient protein inverse folding

Protein Conformation Generation via Force-Guided SE(3) Diffusion Models

Protein generation with evolutionary diffusion: sequence is all you need

ExEnDiff: An Experiment-guided Diffusion model for protein conformational Ensemble generation

De novo design of protein structure and function with RFdiffusion

RiboDiffusion: Tertiary Structure-based RNA Inverse Folding with Generative Diffusion Models