Elucidating structures from spectra using multimodal embeddings and discrete optimization

Adrian Mirza,Kevin Maik Jablonka
DOI: https://doi.org/10.26434/chemrxiv-2024-f3b18
2024-11-22
Abstract:Structure elucidation --- determining molecular structures from spectroscopic data -- remains one of chemistry's most fundamental and challenging tasks, essential for advancing fields from drug discovery to materials science. While machine learning approaches have attempted to automate this process, they typically focus on single spectroscopic techniques and lack crucial confidence metrics, limiting their practical utility. Here, we present spec2struct, a framework that synergistically combines multimodal embeddings, contrastive learning, and evolutionary algorithms to mimic how expert chemists approach structure determination. By aligning encoders for diverse spectroscopic techniques with molecular representations, our system can simultaneously interpret multiple types of spectroscopic evidence. This alignment guides genetic algorithms to evolve chemically valid candidates that best match the experimental data. spec2struct not only outperforms existing methods but also provides calibrated and contextualized confidence estimates. We demonstrate its real-world impact by identifying several published structures incorrectly assigned in the literature. The combination of performance, reliability, and versatility positions spec2struct as a powerful tool for accelerating chemical discovery.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the fundamental and challenging task of inferring molecular structures from spectral data in chemical research. Although machine - learning methods have attempted to automate this process, existing methods usually focus only on a single spectral technique and lack crucial confidence measures, which limits their practical application value. The paper proposes a framework named **spec2struct**, aiming to simulate the methods of expert chemists in structure determination by combining multimodal embedding, contrastive learning and evolutionary algorithms. Specifically, this system is able to: 1. **Interpret multiple spectral evidences simultaneously**: By aligning encoders of different spectral techniques with molecular representations, the system can process multiple types of spectral data simultaneously. 2. **Generate chemically valid candidate structures**: Use genetic algorithms to evolve chemically valid candidate structures that best fit the experimental data. 3. **Provide calibrated and contextualized confidence estimates**: Not only improve the accuracy of predictions, but also provide reliable confidence measures, enabling chemical scientists to make better use of these predictions. **Main contributions**: - **Performance improvement**: When multiple spectral techniques are used in combination, the retrieval performance of the system is significantly improved and can reach a high accuracy rate of 98.5%. - **Error detection**: It can identify mis - assigned structures in the literature, thereby reducing errors in chemical research. - **Flexibility**: The system can not only handle known compounds, but also generate new compound structures, which is suitable for the needs of synthetic chemists. Through these improvements, **spec2struct** has become a powerful tool for accelerating chemical discovery.