Reusability report: Annotating metabolite mass spectra with domain-inspired chemical formula transformers

Janne Heirman,Wout Bittremieux
DOI: https://doi.org/10.26434/chemrxiv-2024-97r5j
2024-03-13
Abstract:We present an in-depth exploration of the Metabolite Inference with Spectrum Transformers (MIST) tool for annotating small molecule mass spectrometry (MS) data, focusing on its reproducibility and generalizability. MIST innovates by integrating a “chemical formula transformer” to process MS/MS spectra, aiming to bridge the substantial knowledge gap in untargeted MS studies, where only a fraction of spectra are confidently annotated. Here, we critically assess MIST’s reproducibility by following the tool’s original training and testing protocols, encountering minor challenges but largely succeeding in replicating results. We also evaluate MIST’s generalizability by applying it to an external dataset from the CASMI 2022 challenge, revealing insights into the model’s performance on previously unseen data. An ablation study further investigates the impact of various model features on database retrieval performance, suggesting that some algorithmic complexities may not significantly enhance performance. Through rigorous evaluation, this work underscores the challenges and considerations in developing robust computational tools for MS data analysis. We advocate for community-wide efforts in benchmarking, transparency, and data sharing to foster advancements in metabolomics and computational biology.
Chemistry
What problem does this paper attempt to address?
The paper attempts to address the issue that in non-targeted mass spectrometry (MS) studies, only a small amount of mass spectrometry data can be reliably annotated, leading to significant knowledge gaps. Specifically, although non-targeted MS studies can acquire thousands to millions of MS/MS spectra, existing state-of-the-art methods can only reliably annotate 5% to 10% of these spectra on average. This limits our ability to accurately determine the molecular structures present in these studies, thereby undermining the potential impact of many biological studies. To tackle this challenge, Goldman et al. recently introduced a tool called Metabolite Inference with Spectrum Transformers (MIST) for annotating MS/MS spectra. MIST addresses MS/MS spectra by introducing a "chemical formula transformer," aiming to integrate domain-specific knowledge into the deep neural network architecture, thereby bridging the knowledge gap in non-targeted MS studies. This paper conducts an in-depth exploration and evaluation of the reproducibility and generalization capabilities of MIST.