Leveraging Pre-Trained LMs for Rapid and Accurate Structure Elucidation from 2D NMR Data

R. Wattenhofer,Susanna Di,Vita Florian Grötschla,Luca A. Lanzendörfer
Abstract:Molecular structure elucidation from NMR data is a crucial process in chemistry, particularly for applications on small and medium molecules in materials science. Despite advances in computational methods, traditional approaches remain time-consuming and data-intensive, necessitating the exploration of more efficient and automated solutions. We propose a novel application of a pretrained language model (LM) for structure elucidation using 2D NMR data, marking the first instance of such an approach with experimental data. Our method generates SMILES strings representing molecular structures by conditioning on both HSQC peaks and the molecular formula, achieving a 74% accuracy rate. This surpasses the previous state-of-the-art achieved with simulated data. By leveraging a pretrained model, our approach requires significantly less data and compute. To our knowledge, this work is the first to apply LMs to automated structure elucidation on 2D NMR spectra, particularly on experimental data.
Materials Science,Chemistry
What problem does this paper attempt to address?