Estimate Metabolite Taxonomy and Structure with a Fragment-Centered Database and Fragment Network

Hansen Zhao,Xu Zhao,Huan Yao,Jiaxin Feng,Sichun Zhang,Xinrong Zhang
DOI: https://doi.org/10.48550/arXiv.2101.03784
2021-01-11
Abstract:Metabolite structure identification has become the major bottleneck of the mass spectrometry based metabolomics research. Till now, number of mass spectra databases and search algorithms have been developed to address this issue. However, two critical problems still exist: the low chemical component record coverage in databases and significant MS/MS spectra variations related to experiment equipment and parameter settings. In this work, we considered the molecule fragment as basic building blocks of the metabolic components which had relatively consistent signatures in MS/MS spectra. And from a bottom-up point of view, we built a fragment centered database, MSFragDB, by reorganizing the data from the Human Metabolome Database (HMDB) and developed an intensity-free searching algorithm to search and rank the most relative metabolite according to the users' input. We also proposed the concept of fragment network, a graph structure that encoded the relationship between the molecule fragments to find close motif that indicated a specific chemical structure. Although based on the same dataset as the HMDB, validation results implied that the MSFragDB had a higher hit ratio and furthermore, estimated possible taxonomy that a query spectrum belongs to when the corresponding chemical component was missing in the database. Aid by the Fragment Network, the MSFragDB was also proved to be able to estimate the right structure while the MS/MS spectrum suffers from the precursor-contamination. The strategy proposed is general and can be adopted in existing databases. We believe MSFragDB and Fragment Network can improve the performance of structure identification with existing data. The beta version of the database is freely available at <a class="link-external link-http" href="http://www.xrzhanglab.com/msfragdb/" rel="external noopener nofollow">this http URL</a>.
Quantitative Methods,Molecular Networks
What problem does this paper attempt to address?
This paper attempts to solve the main bottleneck problem in metabolomics research - metabolite structure identification. Specifically, the author points out two key problems existing in the current mass spectrometry databases and search algorithms: 1. **Low coverage rate of chemical component records in the database**: Although existing mass spectrometry databases such as METLIN and Human Metabolome Database (HMDB) provide a large amount of metabolite information, the chemical component coverage in these databases is limited, resulting in many unknown or unrecorded metabolites being unable to be accurately identified. 2. **Significant changes in MS/MS spectra under different devices and parameter settings**: Due to differences in experimental devices and parameter settings, MS/MS spectra can show significant differences, which makes it difficult for the matching algorithm based on spectral similarity to accurately identify metabolites. To overcome these problems, the author proposes a molecule - fragment - centered method to identify metabolites. The core idea of this method is to regard the molecule fragments of metabolites as basic building blocks and use the relatively consistent features of these fragments in MS/MS spectra to build a fragment - centered database (MSFragDB). In this way, even if some metabolites are not in the database, their possible taxonomy can be inferred through their fragment features. In addition, the author also proposes the concept of molecule fragment network (MFN), which is a graph structure used to encode the relationships between molecule fragments, thereby discovering specific chemical structures. MFN can not only help identify precursor contamination, but also provide more accurate structure estimation in complex sample analysis. Overall, the method proposed in this paper aims to improve the accuracy and efficiency of metabolite structure identification, especially when dealing with complex biological samples.