Enhancing Molecular Structure Elucidation: MultiModalTransformer for both simulated and experimental spectra

Martin Priessner,Richard Lewis,Jon Paul Janet,Magnus Johansson,Anna Tomberg,Jonathan Goodman,Isak Lemurell
DOI: https://doi.org/10.26434/chemrxiv-2024-zmmnw
2024-11-15
Abstract:We present MultiModalTransformer (MMT), a novel deep learning architecture that directly predicts molecular structures from diverse spectroscopic data (1H-NMR, 13C-NMR, HSQC, COSY, IR, and mass spectrometry (MS). Utilizing a modified Transformer model with attention mechanisms, the MMT simultaneously processes multiple data modalities to focus on the most relevant spectral features. Our approach demonstrates significant advancements in automated structure determination, achieving up to 94% correct identifications for real experimental samples despite being trained solely on simulated spectra. To address the challenges of vast chemical space and limited experimental data we introduce an innovative improvement cycle that allows MMT to adapt to new chemical spaces. The model's robustness is evidenced by its ability to maintain substantial predictive power even when starting with slightly incorrect molecular structures, identifying 56% of experimental molecules correctly from modified initial guesses. MMT provides explainable predictions through token-based analysis, offering insights into its decision-making process. We also present a user-friendly GUI that integrates the full improvement cycle workflow, facilitating practical application in chemistry laboratories. By leveraging diverse spectral inputs and adaptive learning techniques, MMT represents a significant step towards fully automated structure elucidation, potentially accelerating drug discovery and natural product research while demonstrating that comprehensive chemical space coverage in training data is more critical than precise spectral accuracy.
Chemistry
What problem does this paper attempt to address?
This paper aims to solve the automation problem in molecular structure analysis, especially by developing a new deep - learning architecture named MultiModalTransformer (MMT) to achieve this goal. MMT can directly predict molecular structures from multiple spectral data such as 1H - NMR, 13C - NMR, HSQC, COSY, IR and mass spectrometry (MS). Specifically, the paper attempts to solve the following key problems: 1. **Multi - modal data processing**: Current computer - aided structure elucidation (CASE) programs are usually only able to process a single type of spectral data, while MMT, through the modified Transformer model and attention mechanism, can process multiple spectral data simultaneously, thus analyzing molecular structures more comprehensively. 2. **Automated structure analysis**: Traditional CASE programs require a great deal of human intervention, especially when identifying relevant peaks in NMR spectra. MMT aims to reduce or eliminate this human intervention and achieve fully automated molecular structure analysis. 3. **Adaptation to new chemical spaces**: Existing CASE methods perform poorly when dealing with unseen chemical spaces because they rely on limited experimental data and databases. The paper proposes an innovative improvement cycle that enables MMT to adapt to new chemical spaces, thereby improving its robustness and accuracy in practical applications. 4. **Explanatory prediction**: To improve the transparency and credibility of the model, MMT provides token - based analysis that can explain its decision - making process and help researchers understand why the model makes specific predictions. 5. **User - friendly interface**: The paper also introduces a user - friendly graphical user interface (GUI) that integrates the complete improvement cycle workflow for easy practical application in chemical laboratories. In summary, through the development of the MMT model, this paper aims to solve the limitations of existing CASE methods in multi - modal data processing, degree of automation, adaptation to new chemical spaces, interpretability and user - friendliness, thereby promoting the development of molecular structure analysis technology.