Extraction of information about the molecule structure directly from GC-MS data

Dmitry D. Matyushin,Anastasiya Yu. Sholokhova
DOI: https://doi.org/10.17308/sorpchrom.2023.23/11317
2023-07-17
Сорбционные и хроматографические процессы
Abstract:Gas chromatography – mass spectrometry (GC-MS) is a very important method of chemical analysis. GC-MS can be used for non-target chemical analysis and preliminary screening of completely unknown compounds. Electron ionization mass spectrometry is commonly used in GC-MS. Some information can be extracted directly from GC-MS data using machine learning methods. There are several previous works in which machine learning models extract information about the presence or absence of given substructures in a molecule directly from the electron ionization mass spectrum. Rarely, the additional data such as molecular weight and retention index are used together with the mass spectrum as input features of such models, however, no systematic comparison of how the use of such data increases the accuracy of the prediction was previously conducted. In this work, gradient boosting was used for prediction of the presence or absence of given substructures in a molecule. The following substructures were considered: aromatic ring, 5-membered aromatic ring, 6-membered aromatic ring without heteroatoms (benzene ring), nitrogen-containing aromatic ring, primary, secondary, and tertiary amino groups, nitrile, hydroxyl, carbonyl, methoxy, methyl, and carboxyl groups. Three types of additional features were used: molecular weight and neutral loss spectra (molecular weight also allows for the neutral loss spectra computation), retention index for the non-polar stationary phase, and retention index for the polar stationary phase. A total of 8 feature sets were considered. In most cases, the molecular weight and neutral loss spectrum considerably improve the accuracy. Retention indices also allow for further accuracy increase. For polar functional groups such as carbonyl and hydroxyl, the effect of using retention indices is maximal. The use of retention indices for two stationary phases allows for the achievement of the best accuracy. The best accuracy of prediction was achieved for the benzene ring and aromatic ring, the worst (but still high) accuracy was observed for the secondary amino group. The achieved accuracy was compared with the previous results. In addition to the classification tasks, the regression tasks were considered. The gradient boosting models that predict the number of aromatic atoms, methyl groups, and benzene rings were developed. It was observed that the use of additional features considerably improves the accuracy in this case. Finally, it should be noted that the regression models underestimate the number of occurrences when the number is high.
What problem does this paper attempt to address?