Impact of noise on inverse design: The case of NMR spectra matching

Dominik Lemm,Guido Falk von Rudorff,O. Anatole von Lilienfeld

2023-10-17

Abstract:Despite its fundamental importance and widespread use for assessing reaction success in organic chemistry, deducing chemical structures from nuclear magnetic resonance (NMR) measurements has remained largely manual and time consuming. To keep up with the accelerated pace of automated synthesis in self driving laboratory settings, robust computational algorithms are needed to rapidly perform structure elucidations. We analyse the effectiveness of solving the NMR spectra matching task encountered in this inverse structure elucidation problem by systematically constraining the chemical search space, and correspondingly reducing the ambiguity of the matching task. Numerical evidence collected for the twenty most common stoichiometries in the QM9-NMR data base indicate systematic trends of more permissible machine learning prediction errors in constrained search spaces. Results suggest that compounds with multiple heteroatoms are harder to characterize than others. Extending QM9 by $\sim$10 times more constitutional isomers with 3D structures generated by Surge, ETKDG and CREST, we used ML models of chemical shifts trained on the QM9-NMR data to test the spectra matching algorithms. Combining both $^{13}\mathrm{C}$ and $^{1}\mathrm{H}$ shifts in the matching process suggests twice as permissible machine learning prediction errors than for matching based on $^{13}\mathrm{C}$ shifts alone. Performance curves demonstrate that reducing ambiguity and search space can decrease machine learning training data needs by orders of magnitude.

Chemical Physics

What problem does this paper attempt to address?

This paper aims to solve the inverse structure analysis problem in nuclear magnetic resonance (NMR) spectrum matching, especially to quickly and accurately determine the chemical structure of reaction products in an automated laboratory environment. Specifically, the research focuses on the following aspects: 1. **Reducing the search space**: By systematically restricting the chemical search space to reduce the ambiguity of the spectrum - matching task, thereby improving the analysis efficiency. 2. **The influence of machine - learning prediction errors**: Analyzed the influence of machine - learning model prediction errors on the success of structure analysis under different chemical search space sizes. 3. **Combining 13C and 1H spectra**: Explored the effect of simultaneously using 13C and 1H spectra for matching, and found that this can significantly reduce ambiguity and improve error tolerance. 4. **Expanding the data set**: Expanded the QM9 - NMR data set by generating more molecular conformations to test the performance of the spectrum - matching algorithm. ### Main findings - **The relationship between search space and prediction error**: The research shows that in a smaller search space, a higher machine - learning prediction error is acceptable. This means that by reducing the search space with prior knowledge, the need for high - precision prediction can be reduced. - **Combination of multiple types of spectra**: Combining 13C and 1H spectra for matching can significantly improve the success rate of analysis. Compared with using 13C or 1H spectra alone, the error tolerance is increased by 85% and 261% respectively. - **Reduction of data requirements**: By reducing the search space and combining multiple types of spectrum information, the amount of training data required for the machine - learning model can be greatly reduced, thereby improving the analysis efficiency. ### Experimental methods - **Spectrum - matching distance metric**: Use the squared Euclidean distance as a metric to rank candidate spectra. - **Chemical shift prediction**: Adopt the kernel ridge regression (KRR) model to predict 13C and 1H chemical shifts, using the local atomic Faber - Christensen - Huang - Lilienfeld (FCHL1947) representation. - **Data set expansion**: Expand the QM9 - NMR data set by generating more molecular conformations, increasing the number of constitutional isomers by about 10 times. ### Conclusion This research systematically controls the prediction accuracy of chemical shifts and the size of the search space, revealing the importance of reducing the search space and combining multiple types of spectrum information in inverse structure analysis. These findings provide theoretical support and technical guidance for the development of more efficient and accurate computer - aided structure analysis algorithms.

Impact of noise on inverse design: The case of NMR spectra matching

NMR Calculations with Quantum Methods: Development of New Tools for Structural Elucidation and Beyond

Metabolite Structure Assignment Using In Silico NMR Techniques

Accurate and efficient structure elucidation from routine one-dimensional NMR spectra using multitask machine learning

Modern Semiempirical Electronic Structure Methods and Machine Learning Potentials for Drug Discovery: Conformers, Tautomers, and Protonation States

Enhancing Chemical Reaction Monitoring with a Deep Learning Model for NMR Spectra Image Matching to Target Compounds

HSQC Spectra Simulation and Matching for Molecular Identification

The importance of nuclear quantum effects for NMR crystallography

Highly Accurate Prediction of NMR Chemical Shifts from Low-Level Quantum Mechanics Calculations Using Machine Learning

Revving up 13C NMR shielding predictions across chemical space: Benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules

Improved Prediction of Carbonless NMR Spectra by the Machine Learning of Theoretical and Fragment Descriptors for Environmental Mixture Analysis

Elucidating Structures of Complex Organic Compounds Using a Machine Learning Model Based on the 13C NMR Chemical Shifts

Novel machine learning insights into the QM7b and QM9 quantum mechanics datasets

Enabling Inverse Design in Chemical Compound Space: Mapping Quantum Properties to Structures for Small Organic Molecules

Cross-Modal Retrieval between 13C NMR Spectra and Structures for Compound Identification Using Deep Contrastive Learning

Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry

The accuracy limit of chemical shift predictions for species in aqueous solution

Leveraging infrared spectroscopy for automated structure elucidation

Prediction of Protein 13cα NMR Chemical Shifts Using a Combination Scheme of Statistical Modeling and Quantum-Mechanical Analysis

Overcoming NMR line broadening of nitrogen containing compounds: A simple solution