Abstract:Machine-learned coarse-grained (MLCG) molecular dynamics is a promising option for modeling biomolecules. However, MLCG models currently require large amounts of data from reference atomistic molecular dynamics or substantial computation for training. Denoising score matching -- the technology behind the widely popular diffusion models -- has simultaneously emerged as a machine-learning framework for creating samples from noise. Models in the first category are often trained using atomistic forces, while those in the second category extract the data distribution by reverting noise-based corruption. We unify these approaches to improve the training of MLCG force-fields, reducing data requirements by a factor of 100 while maintaining advantages typical to force-based parameterization. The methods are demonstrated on proteins Trp-Cage and NTL9 and published as open-source code.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to efficiently learn coarse-grained (CG) models from atomic forces and noise in molecular dynamics simulations using machine learning methods. Specifically, existing machine learning coarse-grained (MLCG) models require a large amount of reference atomic molecular dynamics data or substantial computational resources for training, which is a major obstacle to their widespread application. The authors propose a new training strategy that significantly reduces the amount of training data required while maintaining model accuracy by combining denoising score matching (DSM) and force-based parameterization methods. ### Main Issues: 1. **High Data Demand**: Existing MLCG models require a large amount of reference atomic molecular dynamics data for training, which is very time-consuming and computationally expensive in practical applications. 2. **Low Computational Efficiency**: Traditional MLCG model training requires repeated long-term simulations, further increasing the computational burden. 3. **Insufficient Model Accuracy**: Although existing CG models have made some progress in certain aspects, they still fall short of atomic models in terms of generalization ability and accuracy. ### Solutions: 1. **Combining Denoising Score Matching and Force Matching**: The authors propose a strategy that combines denoising score matching with traditional force matching methods, enhancing the model's learning ability by introducing noise, thereby reducing data demand while maintaining model accuracy. 2. **Efficient Training Method**: By optimizing neural network parameters, the authors' method can generate high-quality CG force fields with a small amount of training data. 3. **Validation and Application**: The authors validated the effectiveness of the method on two proteins (Trp-Cage and NTL9) and demonstrated performance improvements at different noise levels. ### Experimental Results: - **Trp-Cage**: Using the traditional force matching method, 1.6M samples are needed to recover the reweighted free energy surface; with the new method, only 10% of the data is required to achieve similar accuracy. - **NTL9**: Under low data conditions, the model combining force and noise information can significantly improve accuracy, even generating usable CG force fields with only 1% of the training data. ### Discussion: - **Data Efficiency**: The new method is significantly more data-efficient than traditional methods, especially under low data conditions. - **Applicability**: Although the current research focuses on two proteins, the method has broad application potential, particularly in handling complex biomolecular systems. - **Future Directions**: Future research can further explore the application of this method in more biomolecular systems and how to handle situations with low data diversity. In summary, this paper proposes an innovative training strategy that significantly improves the data efficiency and computational efficiency of MLCG models by combining denoising score matching and force matching methods, providing new possibilities for the efficient simulation of complex biomolecular systems.

Learning data efficient coarse-grained molecular dynamics from forces and noise

Learning Effective Molecular Models from Experimental Observables.

Machine Learning of coarse-grained Molecular Dynamics Force Fields

Two for One: Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics

Machine learned coarse-grained protein force-fields: Are we there yet?

Machine learning for molecular simulation

Ensemble learning of coarse-grained molecular dynamics force fields with a kernel approach.

Statistically Optimal Force Aggregation for Coarse-Graining Molecular Dynamics

Navigating protein landscapes with a machine-learned transferable coarse-grained model

Molecular Dynamics with Neural-Network Potentials

Molecular Dynamics with On-the-Fly Machine Learning of Quantum-Mechanical Forces

DiffMD: A Geometric Diffusion Model for Molecular Dynamics Simulations

Machine Learning in QM/MM Molecular Dynamics Simulations of Condensed-Phase Systems

Machine Learning for Molecular Dynamics on Long Timescales

Synthetic Force-Field Database for Training Machine Learning Models to Predict Mobility-Preserving Coarse-Grained Molecular-Simulation Potentials

Machine Learning for Parameter Auto-tuning in Molecular Dynamics Simulations: Efficient Dynamics of Ions near Polarizable Nanoparticles

On the role of gradients for machine learning of molecular energies and forces

Efficient Training of Neural Network Potentials for Chemical and Enzymatic Reactions by Continual Learning

Coarse-Graining with Equivariant Neural Networks: A Path Towards Accurate and Data-Efficient Models

Top-down machine learning of coarse-grained protein force-fields

Machine Learning Force Fields with Data Cost Aware Training