Learning data efficient coarse-grained molecular dynamics from forces and noise

Aleksander E. P. Durumeric,Yaoyi Chen,Frank Noé,Cecilia Clementi
2024-07-01
Abstract:Machine-learned coarse-grained (MLCG) molecular dynamics is a promising option for modeling biomolecules. However, MLCG models currently require large amounts of data from reference atomistic molecular dynamics or substantial computation for training. Denoising score matching -- the technology behind the widely popular diffusion models -- has simultaneously emerged as a machine-learning framework for creating samples from noise. Models in the first category are often trained using atomistic forces, while those in the second category extract the data distribution by reverting noise-based corruption. We unify these approaches to improve the training of MLCG force-fields, reducing data requirements by a factor of 100 while maintaining advantages typical to force-based parameterization. The methods are demonstrated on proteins Trp-Cage and NTL9 and published as open-source code.
Biological Physics
What problem does this paper attempt to address?
This paper attempts to address the problem of how to efficiently learn coarse-grained (CG) models from atomic forces and noise in molecular dynamics simulations using machine learning methods. Specifically, existing machine learning coarse-grained (MLCG) models require a large amount of reference atomic molecular dynamics data or substantial computational resources for training, which is a major obstacle to their widespread application. The authors propose a new training strategy that significantly reduces the amount of training data required while maintaining model accuracy by combining denoising score matching (DSM) and force-based parameterization methods. ### Main Issues: 1. **High Data Demand**: Existing MLCG models require a large amount of reference atomic molecular dynamics data for training, which is very time-consuming and computationally expensive in practical applications. 2. **Low Computational Efficiency**: Traditional MLCG model training requires repeated long-term simulations, further increasing the computational burden. 3. **Insufficient Model Accuracy**: Although existing CG models have made some progress in certain aspects, they still fall short of atomic models in terms of generalization ability and accuracy. ### Solutions: 1. **Combining Denoising Score Matching and Force Matching**: The authors propose a strategy that combines denoising score matching with traditional force matching methods, enhancing the model's learning ability by introducing noise, thereby reducing data demand while maintaining model accuracy. 2. **Efficient Training Method**: By optimizing neural network parameters, the authors' method can generate high-quality CG force fields with a small amount of training data. 3. **Validation and Application**: The authors validated the effectiveness of the method on two proteins (Trp-Cage and NTL9) and demonstrated performance improvements at different noise levels. ### Experimental Results: - **Trp-Cage**: Using the traditional force matching method, 1.6M samples are needed to recover the reweighted free energy surface; with the new method, only 10% of the data is required to achieve similar accuracy. - **NTL9**: Under low data conditions, the model combining force and noise information can significantly improve accuracy, even generating usable CG force fields with only 1% of the training data. ### Discussion: - **Data Efficiency**: The new method is significantly more data-efficient than traditional methods, especially under low data conditions. - **Applicability**: Although the current research focuses on two proteins, the method has broad application potential, particularly in handling complex biomolecular systems. - **Future Directions**: Future research can further explore the application of this method in more biomolecular systems and how to handle situations with low data diversity. In summary, this paper proposes an innovative training strategy that significantly improves the data efficiency and computational efficiency of MLCG models by combining denoising score matching and force matching methods, providing new possibilities for the efficient simulation of complex biomolecular systems.