Graph Denoising Diffusion for Inverse Protein Folding

Kai Yi,Bingxin Zhou,Yiqing Shen,Pietro Liò,Yu Guang Wang
2023-11-07
Abstract:Inverse protein folding is challenging due to its inherent one-to-many mapping characteristic, where numerous possible amino acid sequences can fold into a single, identical protein backbone. This task involves not only identifying viable sequences but also representing the sheer diversity of potential solutions. However, existing discriminative models, such as transformer-based auto-regressive models, struggle to encapsulate the diverse range of plausible solutions. In contrast, diffusion probabilistic models, as an emerging genre of generative approaches, offer the potential to generate a diverse set of sequence candidates for determined protein backbones. We propose a novel graph denoising diffusion model for inverse protein folding, where a given protein backbone guides the diffusion process on the corresponding amino acid residue types. The model infers the joint distribution of amino acids conditioned on the nodes' physiochemical properties and local environment. Moreover, we utilize amino acid replacement matrices for the diffusion forward process, encoding the biologically-meaningful prior knowledge of amino acids from their spatial and sequential neighbors as well as themselves, which reduces the sampling space of the generative process. Our model achieves state-of-the-art performance over a set of popular baseline methods in sequence recovery and exhibits great potential in generating diverse protein sequences for a determined protein backbone structure.
Quantitative Methods,Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to address the problem of inverse protein folding, which is a challenging task because the mapping from protein structure to amino acid sequence is fundamentally a many-to-one relationship. Specifically, many possible amino acid sequences can fold into the same protein backbone structure. Existing methods, such as Transformer-based autoregressive models, face difficulties in capturing this diversity and non-uniqueness. The paper proposes a novel Graph Denoising Diffusion Model (GRADE-IF) for reverse protein folding. The core of the model is to guide the diffusion process on the corresponding amino acid residue types given a protein backbone. The model infers the joint distribution of amino acids conditioned on node's physicochemical properties and local environment. Additionally, the model utilizes an amino acid substitution matrix to encode prior knowledge about the biological significance of amino acids acquired from their spatial and sequential neighbors and themselves during the diffusion forward process, reducing the sampling space of the generation process. The main contributions of the paper include: 1. The proposal of GRADE-IF model, a diffusion model based on rotation-invariant graph neural networks, specifically tailored for reverse folding tasks, capable of generating a wide range of sequence candidates. 2. In discrete diffusion models, uniform noise is commonly used. However, this study innovatively uses the Block Substitution Matrix (BLOSUM) as the translation kernel, encoding prior knowledge about the amino acid response to evolutionary pressure. 3. To accelerate the sampling process, the study adopts a discrete version of the Denoising Diffusion Implicit Model (DDIM) and provides comprehensive theoretical analysis. Experimental results demonstrate that the GRADE-IF model achieves state-of-the-art performance in a series of benchmark tests, significantly improving sequence recovery rates compared to existing methods, particularly in the recovery of conservative regions with biological importance. Additionally, the generated sequence structures by the model are highly consistent with native sequence structures, confirming its ability to generate biologically feasible new sequences given a protein structure.