Applications of Machine Learning to In Silico Quantification of Chemicals without Analytical Standards
Dimitri Panagopoulos Abrahamsson,June-Soo Park,Randolph R. Singh,Marina Sirota,Tracey J. Woodruff
DOI: https://doi.org/10.1021/acs.jcim.9b01096
IF: 6.162
2020-05-07
Journal of Chemical Information and Modeling
Abstract:Non-targeted analysis provides a comprehensive approach to analyze environmental and biological samples for nearly all chemicals present. One of the main shortcomings of current analytical methods and workflows is that they are unable to provide any quantitative information constituting an important obstacle in understanding environmental fate and human exposure. Herein, we present an in silico quantification method using mahine-learning for chemicals analyzed using electrospray ionization (ESI). We considered three data sets from different instrumental setups: (i) capillary electrophoresis electrospray ionization-mass spectrometry (CE-MS) in positive ionization mode (ESI+), (ii) liquid chromatography quadrupole time-of-flight mass spectrometry (LC-QTOF/MS) in ESI+ and (iii) LC-QTOF/MS in negative ionization mode (ESI−). We developed and applied two different machine-learning algorithms: a random forest (RF) and an artificial neural network (ANN) to predict the relative response factors (RRFs) of different chemicals based on their physicochemical properties. Chemical concentrations can then be calculated by dividing the measured abundance of a chemical, as peak area or peak height, by its corresponding RRF. We evaluated our models and tested their predictive power using 5-fold cross-validation (CV) and <i>y</i> randomization. Both the RF and the ANN models showed great promise in predicting RRFs. However, the accuracy of the predictions was dependent on the data set composition and the experimental setup. For the CE-MS ESI+ data set, the best model predicted measured RRFs with a mean absolute error (MAE) of 0.19 log units and a cross-validation coefficient of determination (<i>Q</i><sup>2</sup>) of 0.84 for the testing set. For the LC-QTOF/MS ESI+ data set, the best model predicted measured RRFs with an MAE of 0.32 and a <i>Q</i><sup>2</sup> of 0.40. For the LC-QTOF/MS ESI– data set, the best model predicted measured RRFs with a MAE of 0.50 and a <i>Q</i><sup>2</sup> of 0.20. Our findings suggest that machine-learning algorithms can be used for predicting concentrations of nontargeted chemicals with reasonable uncertainties, especially in ESI+, while the application on ESI– remains a more challenging problem.The Supporting Information is available free of charge at <a class="ext-link" href="/doi/10.1021/acs.jcim.9b01096?goto=supporting-info">https://pubs.acs.org/doi/10.1021/acs.jcim.9b01096</a>.Chemical names and physicochemical descriptors of the chemicals in the CE-MS ESI+ data set (<a class="ext-link" href="/doi/suppl/10.1021/acs.jcim.9b01096/suppl_file/ci9b01096_si_001.xlsx">XLSX</a>)Chemical names and physicochemical descriptors of the chemicals in the LC-QTOF/MS data sets (<a class="ext-link" href="/doi/suppl/10.1021/acs.jcim.9b01096/suppl_file/ci9b01096_si_002.xlsx">XLSX</a>)Information on the design of the algorithms and the optimization of the hyperparameters (<a class="ext-link" href="/doi/suppl/10.1021/acs.jcim.9b01096/suppl_file/ci9b01096_si_003.pdf">PDF</a>)This article has not yet been cited by other publications.
chemistry, multidisciplinary, medicinal,computer science, interdisciplinary applications, information systems