Spectro: A multi-modal approach for molecule elucidation using IR and NMR data

Rodrigo Alejandro Vargas Hernandez,Edwin Chacko,Rudra Sondhi,Kylie L. Luska,Arnav Praveen
DOI: https://doi.org/10.26434/chemrxiv-2024-37v2j
2024-11-06
Abstract:Molecular structure elucidation is a crucial but fundamentally challenging step in the characterization of materials given the large number of possible structures. Here, we introduce Spectro, an innovative multi-modal approach for molecular elucidation that combines $\CNMR$ and $\HNMR$ NMR data with IR. Spectro translates the embedded representations of the spectra into molecular structures using the SELFIES notation. We employed a vision model for the embedded representation of the IR data, which was pretrained to detect relevant functional group peaks in the IR spectra achieving an F1 score of 91\%. For NMR data, we utilized LLM2Vec, treating the NMR spectra as text. This integration of multiple spectroscopic techniques allows Spectro to achieve an overall test accuracy of 93\% when trained jointly with the vision model for the IR spectra, and 82\% when trained with fixed embeddings. Our approach demonstrates the potential of multi-modal learning in tackling complex molecular characterization tasks.
Chemistry
What problem does this paper attempt to address?
The paper attempts to address the complex challenges in molecular structure elucidation. Specifically, it proposes a multimodal approach named Spectro, which integrates infrared spectroscopy (IR) and nuclear magnetic resonance spectroscopy (NMR) data to infer molecular structures. Traditionally, molecular structure elucidation is a highly complex and time-consuming task that requires the combination of multiple spectroscopic techniques and relies on the experience and knowledge of chemists. Spectro aims to automate this process, improving the accuracy and efficiency of elucidation. ### Main Issues: 1. **Complexity of Molecular Structure Elucidation**: As the number of atoms in a molecule increases, the number of possible molecular structures grows exponentially, making molecular structure elucidation extremely complex. 2. **Integration of Multispectral Data**: How to effectively integrate different types of spectroscopic data (such as IR and NMR) to improve the accuracy of molecular structure elucidation. 3. **Automated Elucidation**: How to utilize machine learning and deep learning techniques to automate molecular structure elucidation, reducing reliance on human expertise. ### Solutions: - **Multimodal Approach**: Spectro combines 13C NMR, 1H NMR, and IR data to improve the accuracy of molecular structure elucidation through multimodal learning. - **Embedding Representations**: Using visual models (such as ResNet50) to process IR data and convert it into embedding vectors; using text encoders (such as LLM2Vec) to process NMR data, also generating embedding vectors. - **Molecular Decoder**: Using an RNN-based decoder to translate these embedding vectors into molecular structures, employing the SELFIES representation to describe molecular structures. - **Pre-training and Joint Training**: The model can be optimized through pre-training or joint training to improve prediction accuracy. ### Experimental Results: - **Test Set Accuracy**: On the test set, Spectro achieved an overall accuracy of 93%, performing particularly well in joint training mode. - **Functional Group Detection**: For IR data, the j-IR-vis model achieved an accuracy of 91% in detecting functional groups. - **Molecular Structure Prediction**: Spectro accurately predicted 88% of molecular structures (Tanimoto Similarity = 1) and made no erroneous token predictions in 91% of the test molecules. ### Future Prospects: - **Expansion to Other Spectroscopic Techniques**: Plans to extend Spectro to other spectroscopic techniques such as 2D NMR to further enhance molecular structure elucidation capabilities. - **Multitask Learning**: Combining multitask learning methods to handle missing data situations, improving the robustness and applicability of the model. Overall, Spectro significantly improves the accuracy and efficiency of molecular structure elucidation through multimodal learning and automated methods, providing strong support for research in the fields of chemistry and materials science.