Accurately estimating activation energies by leveraging neural network methods and a large dataset

Guo-Jin Cao
DOI: https://doi.org/10.26434/chemrxiv-2024-4qb7s
2024-09-04
Abstract:Determining activation energies is integral to the field of computational chemistry. With the emergence of artificial intelligence, new methodologies such as neural networks have been introduced to accelerate the prediction of these energies, representing a notable advancement in this scientific domain. By incorporating topological indices, molecular fingerprints of reactants and products, and reaction enthalpy as descriptors, a deep-learning framework was developed. This framework utilizes the Reaction Graph Depth 1 (RGD1) dataset, which includes 176,992 organic reactions, to accurately estimate activation energies using artificial neural networks. The results demonstrated training R² values of 0.99, with a mean absolute error (MAE) of 2.06 kcal/mol and a root mean square error (RMSE) of 3.20 kcal/mol across an activation energy range of nearly 200 kcal/mol. These results exceed the accuracy of the other models on the same dataset as well as different datasets. Based on the learning curve, the training and validation losses were nearly identical and minimized, suggesting that the model was effectively regularized. The Chemprop model, with optimized hyperparameters, reached an R² of 0.93 on the test set, which is slightly below the performance of the previously discussed ANN method.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately predict the activation energy of chemical reactions using neural network methods and large - scale data sets in the field of computational chemistry. Specifically, by introducing topological indices, molecular fingerprints of reactants and products, and reaction enthalpy as descriptors, the author developed a deep - learning framework aiming to improve the accuracy of activation energy prediction. This research is of great significance for understanding chemical reaction behavior, designing drugs, and innovating catalysts. The main contributions of the paper are as follows: 1. **Use of data sets**: The Reaction Graph Depth 1 (RGD1) data set containing 176,992 organic reactions was used. 2. **Model performance**: The \( R^2 \) value of the model on the training set reached 0.99, and the \( R^2 \) value on the test set reached 0.98. The root - mean - square error (RMSE) was 4.15 kcal/mol, and the mean absolute error (MAE) was 2.62 kcal/mol. 3. **Generalization ability**: The model also showed good generalization ability on external test sets (such as the GPOC data set), further verifying its applicability on different data sets. Through these methods, the paper shows how to use machine - learning techniques to significantly improve the accuracy of activation energy prediction, thereby reducing the high computational cost of traditional quantum - chemical calculations.