ExEnDiff: An Experiment-guided Diffusion model for protein conformational Ensemble generation

Yikai Liu,Zongxin Yu,Richard J. Lindsay,Guang Lin,Ming Chen,Abhilash Sahoo,Sonya M Hanson
DOI: https://doi.org/10.1101/2024.10.04.616517
2024-11-07
Abstract:Understanding protein conformation is key to understanding their function. Importantly, most proteins adopt multiple conformations with non-trivial ensemble distributions that change depending on their environment to perform functions like catalysis, signaling, and transport. Recently, machine learning techniques, especially deep generative models, have been employed to develop protein conformation generators. These models, known as unified protein ensemble samplers, are trained on the PDB dataset and can generate diverse protein conformation ensembles given a protein sequence. However, their reliance solely on structural data from the PDB, which primarily captures folded protein states, restricts the diversity of the generated ensembles and can result in physically unrealistic conformations. In this paper, we overcome these challenges by introducing ExEnDiff, an experiment-guided diffusion model for protein conformation generation. ExEnDiff integrates experimental measurements as a physical prior, enabling the generation of protein conformations with desired properties. Our experiments on a variety of fast-folding and intrinsically disordered proteins demonstrate that ExEnDiff significantly advances the capabilities of current unified protein ensemble samplers. With little computational cost, ExEnDiff can capture important proteins' configuration properties and the underlying Boltzmann distribution, paving the way for a next-generation molecular dynamics engine. We further demonstrate the effectiveness of ExEnDiff to capture conformational changes in the presence of mutations and as an efficient tool for determining a reasonable CV space for protein ensembles. With these results, ExEnDiff is well-poised to push the study of protein ensembles into a data-rich regime currently available to few problems in biology.
Biophysics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing protein conformation generation methods in generating protein conformation sets. Specifically: 1. **Lack of diversity**: Existing unified protein set samplers (such as models trained on PDB data) mainly rely on known folded protein structures, which limits the diversity of the generated conformation sets, especially when dealing with disordered or partially disordered proteins. 2. **Physically unrealistic**: Due to the lack of guidance from experimental data, the conformations generated by these models may be physically unreasonable and cannot accurately reflect the real conformation distribution of proteins. 3. **Difficulty in capturing the impact of mutations**: Existing deep - learning models have difficulty in accurately capturing the impact of mutations on protein conformations when dealing with protein mutations, especially in terms of global structural changes. To overcome these problems, the authors introduced an experiment - guided diffusion model - ExEnDiff (Experiment - guided Diffusion model for protein conformation generation). By integrating experimental measurement data as a physical prior, ExEnDiff can generate protein conformation sets with the required physical properties and can more accurately reflect the underlying Boltzmann distribution. ### Main contributions 1. **Improving the generation quality**: Guided by experimental data, ExEnDiff significantly improves the quality of the generated protein conformation sets, especially when dealing with disordered proteins. 2. **Capturing the impact of mutations**: ExEnDiff can better capture the impact of protein mutations on conformations and provide more accurate conformation sets of mutant proteins. 3. **Extension of applications**: ExEnDiff is not only applicable to traditional experimental data (such as NMR, SAXS, cryo - EM, etc.), but can also be used to generate data - driven collective variables (CVs), further improving the efficiency of molecular dynamics simulations. ### Experimental verification The authors verified the effectiveness of ExEnDiff through a series of experiments: - **Benchmark tests**: A set of ordered and disordered proteins were used for benchmark tests, and the results showed that the conformation sets generated by ExEnDiff were closer to the real Boltzmann distribution in physical properties. - **Mutant proteins**: By generating the conformation sets of Chignolin mutants, the advantages of ExEnDiff in capturing the impact of mutations were demonstrated. - **cryo - EM images**: Using synthetic cryo - EM 2D images as guiding parameters, the robustness of ExEnDiff in dealing with noisy data was verified. In conclusion, by integrating experimental data, ExEnDiff significantly improves the accuracy and diversity of protein conformation generation, providing a new tool for protein structure research.