Random sampling versus active learning algorithms for machine learning potentials of quantum liquid water

Nore Stolte,János Daru,Harald Forbert,Dominik Marx,Jörg Behler
2024-10-15
Abstract:Training accurate machine learning potentials requires electronic structure data comprehensively covering the configurational space of the system of interest. As the construction of this data is computationally demanding, many schemes for identifying the most important structures have been proposed. Here, we compare the performance of high-dimensional neural network potentials (HDNNPs) for quantum liquid water at ambient conditions trained to data sets constructed using random sampling as well as various flavors of active learning based on query by committee. Contrary to the common understanding of active learning, we find that for a given data set size, random sampling leads to smaller test errors for structures not included in the training process. In our analysis we show that this can be related to small energy offsets caused by a bias in structures added in active learning, which can be overcome by using instead energy correlations as an error measure that is invariant to such shifts. Still, all HDNNPs yield very similar and accurate structural properties of quantum liquid water, which demonstrates the robustness of the training procedure with respect to the training set construction algorithm even when trained to as few as 200 structures. However, we find that for active learning based on preliminary potentials, a reasonable initial data set is important to avoid an unnecessary extension of the covered configuration space to less relevant regions.
Chemical Physics
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore and compare the effects of two different training set construction methods on the performance of machine - learning potential functions (MLPs) for quantum liquid water. Specifically, the researchers focus on the effects of **Random Sampling** and **Active Learning** when training high - dimensional neural - network potential functions (HDNNPs). #### Main problems 1. **Effectiveness of training set construction**: - The researchers hope to understand which method can more effectively improve the prediction accuracy of MLPs given the size of the data set. - Especially for such a complex system as quantum liquid water, how to select the most appropriate structures to construct the training set is a key issue. 2. **Performance differences of different methods**: - By comparing random sampling and various active learning strategies (such as active learning based on committee query), the paper analyzes their performance under different training set sizes. - The study finds that although active learning is generally considered to be able to select important structures more efficiently, in some cases, random sampling can produce better test errors instead. 3. **Structural diversity and model generalization ability**: - The research also explores the influence of different training set construction methods on the model generalization ability, especially how to balance the structural diversity in the training set to ensure that the model can also perform well on unseen structures. - Specifically, active learning tends to select more diverse structures, but this may also lead to a decline in the fitting quality in the near - equilibrium region. #### Conclusions - Although active learning can theoretically select training samples more intelligently, in practical applications, especially in complex systems such as quantum liquid water, random sampling can sometimes provide better prediction performance instead. - This finding challenges the current general understanding of active learning and suggests that we need to re - evaluate the advantages and disadvantages of different training set construction methods, especially when dealing with systems with high diversity and complexity. ### Summary Through a detailed comparison of random sampling and active learning methods, this paper reveals that the choice of training set has a significant impact on the model performance when constructing machine - learning potential functions. The research results not only provide valuable references for future similar research, but also provide new ideas for developing more efficient training set construction algorithms.