Comparative analysis of continuous similarity measures for compound identification in mass spectrometry-based metabolomics

Hunter Dlugas,Xiang Zhang,Seongho Kim
DOI: https://doi.org/10.26434/chemrxiv-2024-5fm7t
2024-07-19
Abstract:In mass spectrometry (MS)-based metabolomics, compound identification relies on Liquid Chromatography-MS (LC-MS) and Gas Chromatography-MS (GC-MS). The most popular and efficient approach for this purpose is the comparison of similarity scores between experimental spectra and reference spectra. Among the various single and composite similarity measures, the Cosine Correlation is widely favored due to its simplicity, efficiency, and effectiveness. Recently, the Shannon Entropy Correlation has shown superior performance over several other measures, including the Cosine Correlation, in LC-MS-based metabolomics, particularly concerning receiver operating characteristic (ROC) curves and false discovery rates. However, previous comparisons did not consider the weight factor transformation, which is critical for achieving higher accuracy with the cosine correlation. This study conducted a comparative analysis of the Cosine Correlation and Shannon Entropy Correlation, incorporating the weight factor transformation during preprocessing. Additionally, we developed a novel entropy correlation measure, the Tsallis Entropy Correlation, which offers greater versatility than the Shannon Entropy Correlation. Our results indicate that the weight factor transformation is essential for achieving higher accuracy in both LC-MS and GC-MS-based compound identification. While the Tsallis Entropy Correlation outperforms the Shannon Entropy Correlation, it is also more computationally expensive. The Cosine Correlation, when combined with the weight factor transformation, achieves the highest accuracy and the lowest computational expense, demonstrating its robustness and efficiency in MS-based compound identification.
Chemistry
What problem does this paper attempt to address?
This paper primarily discusses how to more effectively identify compounds in mass spectrometry-based metabolomics. The study compared three similarity measurement methods: cosine similarity, Shannon entropy similarity, and a new entropy similarity, namely Tsallis entropy similarity. The performance of these methods in liquid chromatography-mass spectrometry (LC-MS) and gas chromatography-mass spectrometry (GC-MS) was analyzed, considering the transformation of weighting factors. Cosine similarity is widely used due to its simplicity, efficiency, and effectiveness, but previous studies did not consider the transformation of weighting factors, which is crucial for improving accuracy. Shannon entropy similarity has shown better performance than cosine similarity in some cases, especially in terms of ROC curves and false discovery rates. However, the paper points out that the transformation of weighting factors is very critical for improving the accuracy of compound identification in LC-MS and GC-MS. The paper also introduces a new type of entropy similarity, Tsallis entropy similarity, which offers greater flexibility than Shannon entropy similarity, although it is computationally more expensive. The results show that cosine similarity combined with the transformation of weighting factors achieves the best balance between accuracy and computational cost, demonstrating its robustness and efficiency in compound identification based on mass spectrometry. Furthermore, the paper explores the processing order, particularly the application of weighting factor transformation in preprocessing, and its impact on different similarity measurements. The study indicates that for all three similarity measurements, especially when the weighting factor transformation is performed after peak normalization and noise removal, the identification performance is enhanced. Tsallis entropy similarity surpasses Shannon entropy similarity in identification performance but at a higher computational cost. Overall, cosine similarity combined with the transformation of weighting factors provides the highest accuracy and the lowest computational cost.