General Chemically Intuitive Atom- and Bond-level DFT Descriptors for Machine Learning Approaches to Reaction Condition Prediction

Klavs F Jensen,Miguel Nouman,Richard B. Canty,Brent A. Koscher,Matthew A. McDonald
DOI: https://doi.org/10.26434/chemrxiv-2024-wbxp6
2024-12-02
Abstract:We demonstrate the usefulness of general atom- and bond-level DFT descriptors to enhance the performance of neural networks for general reaction condition prediction. We treat reaction condition prediction as a multi-class classification task and report the performance of neural networks trained on 59,512 reactions with 283 distinct reaction condition classes and varying input embedding compositions. We show that by combining structural and general DFT descriptors in optimized ratios, models with input size up to 15% smaller than their purely structural counterparts can provide comparable recall, top-1 and top-3 accuracies. Moreover, we report improvements of up to 6%, 7% and 9% in weighted F1 score, top-1 accuracy and weighted recall, respectively, for neural networks trained on combined general DFT and structural descriptors when compared to purely structural models with equivalent architectures and input sizes. Remarkably, these results were achieved using a training set containing 267 times fewer data points than the one used for generating and embedding structural descriptors, despite both embedding strategies being similar unsupervised learning algorithms.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use general atomic - and bond - level density functional theory (DFT) descriptors to enhance the performance of neural networks in chemical reaction condition prediction. Specifically, the author focuses on how to improve the predictive ability of the model by combining structural descriptors and DFT descriptors. In particular, in the case of a small data set, whether more abundant input information can be introduced to reduce the dependence on large - scale data sets, thereby achieving more efficient data use and model training. ### Background and Motivation of the Paper Automatic prediction of chemical reaction conditions is one of the key steps for high - throughput experiment (HTE) platforms to improve their synthesis capabilities, which can further reduce human intervention in the automated workflow. Although there are already some specialized machine - learning models that can predict certain types of chemical reaction conditions, models that can accurately predict a wide range of chemical reaction conditions are still rare. A major challenge is the diversity of chemical transformations, which makes it very difficult to construct high - quality data sets. In particular, a large amount of data is required to train high - capacity models to identify multiple reaction patterns. ### Solution To meet this challenge, the author proposes a new method, that is, using quantum - chemical descriptors (especially DFT descriptors) to enhance the quality of input embeddings, thereby improving the performance of the model. The specific methods are as follows: 1. **Data Set Generation**: The author used the reaction data in the Pistachio database and extracted the geometric structures of related compounds from the PubChemQC database, and then carried out DFT calculations on a high - performance computing cluster to generate atomic - and molecular - level quantum - chemical descriptors. 2. **Descriptor Types**: - **Atomic - level Descriptors**: These include the products of the energies and occupation numbers of 31 natural atomic orbitals, as well as atomic mass, natural charge, and Hirshfeld charge. - **Bond - level Descriptors**: These include the products of the energies and occupation numbers of the three highest natural bond orbitals and their antibonding orbitals, as well as bond length. 3. **Model Training**: The author used a feed - forward neural network for multi - class classification tasks. The training set contains 59,512 reactions, with a total of 283 different reaction condition categories. The inputs of the model include pure structural descriptors, pure DFT descriptors, and a combination of the two. 4. **Performance Evaluation**: By comparing the model performance under different input embedding combinations, the author found that the model combining structural descriptors and DFT descriptors can achieve performance comparable to that of the pure - structure model even with a smaller input size, and performs better in larger models. Specifically, for a model with a 4,000 - dimensional input, the model with combined descriptors has improvements of 6%, 7%, and 9% in weighted F1 - score, Top - 1 accuracy, and weighted recall rate, respectively. ### Conclusion By introducing DFT descriptors, the author has successfully improved the performance of the chemical reaction condition prediction model, especially in the case of a small data set. This method of combining structural descriptors and DFT descriptors not only reduces the dependence on large - scale data sets but also makes the model more compact and efficient, suitable for real - time deployment and use in high - throughput experiment platforms.