Impact of noise on inverse design: The case of NMR spectra matching

Dominik Lemm,Guido Falk von Rudorff,O. Anatole von Lilienfeld
2023-10-17
Abstract:Despite its fundamental importance and widespread use for assessing reaction success in organic chemistry, deducing chemical structures from nuclear magnetic resonance (NMR) measurements has remained largely manual and time consuming. To keep up with the accelerated pace of automated synthesis in self driving laboratory settings, robust computational algorithms are needed to rapidly perform structure elucidations. We analyse the effectiveness of solving the NMR spectra matching task encountered in this inverse structure elucidation problem by systematically constraining the chemical search space, and correspondingly reducing the ambiguity of the matching task. Numerical evidence collected for the twenty most common stoichiometries in the QM9-NMR data base indicate systematic trends of more permissible machine learning prediction errors in constrained search spaces. Results suggest that compounds with multiple heteroatoms are harder to characterize than others. Extending QM9 by $\sim$10 times more constitutional isomers with 3D structures generated by Surge, ETKDG and CREST, we used ML models of chemical shifts trained on the QM9-NMR data to test the spectra matching algorithms. Combining both $^{13}\mathrm{C}$ and $^{1}\mathrm{H}$ shifts in the matching process suggests twice as permissible machine learning prediction errors than for matching based on $^{13}\mathrm{C}$ shifts alone. Performance curves demonstrate that reducing ambiguity and search space can decrease machine learning training data needs by orders of magnitude.
Chemical Physics
What problem does this paper attempt to address?
This paper aims to solve the inverse structure analysis problem in nuclear magnetic resonance (NMR) spectrum matching, especially to quickly and accurately determine the chemical structure of reaction products in an automated laboratory environment. Specifically, the research focuses on the following aspects: 1. **Reducing the search space**: By systematically restricting the chemical search space to reduce the ambiguity of the spectrum - matching task, thereby improving the analysis efficiency. 2. **The influence of machine - learning prediction errors**: Analyzed the influence of machine - learning model prediction errors on the success of structure analysis under different chemical search space sizes. 3. **Combining 13C and 1H spectra**: Explored the effect of simultaneously using 13C and 1H spectra for matching, and found that this can significantly reduce ambiguity and improve error tolerance. 4. **Expanding the data set**: Expanded the QM9 - NMR data set by generating more molecular conformations to test the performance of the spectrum - matching algorithm. ### Main findings - **The relationship between search space and prediction error**: The research shows that in a smaller search space, a higher machine - learning prediction error is acceptable. This means that by reducing the search space with prior knowledge, the need for high - precision prediction can be reduced. - **Combination of multiple types of spectra**: Combining 13C and 1H spectra for matching can significantly improve the success rate of analysis. Compared with using 13C or 1H spectra alone, the error tolerance is increased by 85% and 261% respectively. - **Reduction of data requirements**: By reducing the search space and combining multiple types of spectrum information, the amount of training data required for the machine - learning model can be greatly reduced, thereby improving the analysis efficiency. ### Experimental methods - **Spectrum - matching distance metric**: Use the squared Euclidean distance as a metric to rank candidate spectra. - **Chemical shift prediction**: Adopt the kernel ridge regression (KRR) model to predict 13C and 1H chemical shifts, using the local atomic Faber - Christensen - Huang - Lilienfeld (FCHL1947) representation. - **Data set expansion**: Expand the QM9 - NMR data set by generating more molecular conformations, increasing the number of constitutional isomers by about 10 times. ### Conclusion This research systematically controls the prediction accuracy of chemical shifts and the size of the search space, revealing the importance of reducing the search space and combining multiple types of spectrum information in inverse structure analysis. These findings provide theoretical support and technical guidance for the development of more efficient and accurate computer - aided structure analysis algorithms.