P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching

Yaowei Jin,Qi Huang,Ziyang Song,Mingyue Zheng,Dan Teng,Qian Shi
2024-11-26
Abstract:Biological processes, functions, and properties are intricately linked to the ensemble of protein conformations, rather than being solely determined by a single stable conformation. In this study, we have developed P2DFlow, a generative model based on SE(3) flow matching, to predict the structural ensembles of proteins. We specifically designed a valuable prior for the flow process and enhanced the model's ability to distinguish each intermediate state by incorporating an additional dimension to describe the ensemble data, which can reflect the physical laws governing the distribution of ensembles, so that the prior knowledge can effectively guide the generation process. When trained and evaluated on the MD datasets of ATLAS, P2DFlow outperforms other baseline models on extensive experiments, successfully capturing the observable dynamic fluctuations as evidenced in crystal structure and MD simulations. As a potential proxy agent for protein molecular simulation, the high-quality ensembles generated by P2DFlow could significantly aid in understanding protein functions across various scenarios. Code is available at <a class="link-external link-https" href="https://github.com/BLEACH366/P2DFlow" rel="external noopener nofollow">this https URL</a>.
Biological Physics,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to predict the conformational ensembles of proteins more accurately. Specifically, the function of a protein depends not only on its single stable conformation, but also is closely related to its multiple conformational states and their dynamic changes. Existing experimental methods such as X - ray crystallography and nuclear magnetic resonance (NMR) have limitations in spatial and temporal scales, while traditional molecular dynamics simulation (MD) and Monte Carlo (MC) methods can explore the conformational space of proteins, but they have low computational efficiency and are prone to getting trapped in local minima when dealing with large - molecular systems. In addition, although some deep - learning - based methods such as AlphaFold and ESMFold perform excellently in predicting a single structure of a protein, they still face challenges in generating multi - conformational ensembles of proteins, especially when predicting intermediate states that do not exist in known data. Therefore, this research proposes a new generative model P2DFlow based on SE(3) - flow matching, aiming to overcome the limitations of the above - mentioned methods and generate conformational ensembles of proteins more efficiently and accurately, thereby helping to gain a deeper understanding of the functional mechanisms of proteins in different environments.