LaGDif: Latent Graph Diffusion Model for Efficient Protein Inverse Folding with Self-Ensemble

Taoyu Wu,Yu Guang Wang,Yiqing Shen
2024-11-04
Abstract:Protein inverse folding aims to identify viable amino acid sequences that can fold into given protein structures, enabling the design of novel proteins with desired functions for applications in drug discovery, enzyme engineering, and biomaterial development. Diffusion probabilistic models have emerged as a promising approach in inverse folding, offering both feasible and diverse solutions compared to traditional energy-based methods and more recent protein language models. However, existing diffusion models for protein inverse folding operate in discrete data spaces, necessitating prior distributions for transition matrices and limiting smooth transitions and gradients inherent to continuous spaces, leading to suboptimal performance. Drawing inspiration from the success of diffusion models in continuous domains, we introduce the Latent Graph Diffusion Model for Protein Inverse Folding (LaGDif). LaGDif bridges discrete and continuous realms through an encoder-decoder architecture, transforming protein graph data distributions into random noise within a continuous latent space. Our model then reconstructs protein sequences by considering spatial configurations, biochemical attributes, and environmental factors of each node. Additionally, we propose a novel inverse folding self-ensemble method that stabilizes prediction results and further enhances performance by aggregating multiple denoised output protein sequence. Empirical results on the CATH dataset demonstrate that LaGDif outperforms existing state-of-the-art techniques, achieving up to 45.55% improvement in sequence recovery rate for single-chain proteins and maintaining an average RMSD of 1.96 Å between generated and native structures. The code is public available at <a class="link-external link-https" href="https://github.com/TaoyuW/LaGDif" rel="external noopener nofollow">this https URL</a>.
Quantitative Methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is Protein Inverse Folding. Specifically, the goal of Protein Inverse Folding is to identify feasible amino acid sequences that can fold into a given protein structure, which is helpful for designing new proteins with specific functions for use in fields such as drug discovery, enzyme engineering, and biomaterial development. Existing diffusion models, when dealing with the Protein Inverse Folding task, usually operate in a discrete data space and require a pre - set transformation matrix, which limits the smooth transition and the continuity of gradients, resulting in sub - optimal performance. To overcome these limitations, the paper proposes a Protein Inverse Folding method (LaGDif) based on the Latent Graph Diffusion Model. This model converts the protein graph data distribution into random noise in a continuous latent space through an encoder - decoder architecture, and then reconstructs the protein sequence by considering the spatial configuration, biochemical properties, and environmental factors of each node. In addition, the paper also proposes a new Inverse Folding self - ensemble method, which stabilizes the prediction results and further improves performance by aggregating multiple denoised output protein sequences. ### Main Contributions 1. **LaGDif Model**: A latent - space diffusion model based on Equivariant Graph Neural Network (EGNN), which can generate diverse protein sequences while maintaining structural integrity. LaGDif overcomes the limitations of discrete diffusion models by operating in a continuous latent space, allowing for a smoother sample distribution and better exploration of the sequence space. 2. **ESM2 Pretrained Encoder**: Utilizes the ESM2 pretrained encoder to encode amino acids, incorporating evolutionary information into the model and enhancing the biological relevance of the sample distribution in the latent space. 3. **Guided Sampling Method**: A guided sampling method for controlling noise ensures the diversity of generated sequences while retaining crucial structural integrity. Starting from a point close to the target distribution, it promotes a more efficient and stable denoising process. 4. **Self - Ensemble Method**: Aggregates multiple denoised output protein sequences at each sampling step, reducing individual biases and errors and improving the robustness and accuracy of predictions. ### Experimental Results The experiments were carried out on the CATH dataset, and the results show that LaGDif significantly outperforms existing methods in the Protein Inverse Folding task. Specifically: - **Sequence Recovery Rate**: LaGDif increases the sequence recovery rate by 41.7%, 45.55%, and 36.52% for short - chain proteins, single - chain proteins, and all tested protein categories, respectively. - **Structural Quality**: The generated protein structures perform well in metrics such as TM - score, average pLDDT, and average RMSD. In particular, the average RMSD is 1.96 Å, indicating that the generated structure is highly similar to the native structure. ### Case Studies The paper also evaluates the performance of LaGDif in detail through two specific protein cases (2EBO and 3OUS). The results show that LaGDif performs excellently in these proteins with complex structures. In particular, in the 3OUS protein, it achieves a sequence recovery rate of 87%, an RMSD of 0.3 Å, and a TM - score of 0.99. ### Conclusion By combining latent - space diffusion, guided initial noise, and self - ensemble schemes, LaGDif not only performs well in sequence recovery but also maintains a high structural fidelity of the generated proteins, and is expected to accelerate the design and development of new proteins.