Exploring the Chemical Subspace of RPLC: a Data Driven Approach

Denice van Herwerden,Alexandros Nikolopoulos,Leon Barron,Jake O'Brien,Bob Pirok,Kevin Thomas,Saer Samanipour
DOI: https://doi.org/10.26434/chemrxiv-2023-bdwh0-v3
2024-01-23
Abstract:The chemical space is comprised of a vast number of possible structures, of which an unknown portion comprises the human and environmental exposome. Such samples are frequently analyzed using non-targeted analysis via liquid chromatography (LC) coupled to high-resolution mass spectrometry often employing a reversed phase (RP) column. However, prior to analysis, the contents of these samples are unknown and could be comprised of thousands of known and unknown chemical constituents. Moreover, it is unknown which part of the chemical space is sufficiently retained and eluted using RPLC. Therefore, we present a generic framework that uses a data driven approach to predict whether molecules fall "inside", "maybe" inside, or "outside" of the RPLC subspace. Firstly, three retention index random forest (RF) regression models were constructed that showed that molecular fingerprints are able to predict RPLC retention behavior. Secondly, these models were used to setup the dataset for building a RPLC RF classification model. The RPLC classification model was able to correctly predict whether a chemical belonged to the RPLC subspace with an accuracy of 92% for the testing set. Finally, applying this model to the 91737 small molecules (i.e., <=1000 Da) in NORMAN SusDat showed that 19.1% fall "outside" of the RPLC subspace. Knowing which chemicals are outside of the RPLC subspace can assist in reducing potential candidates for library searching and avoid screening for chemicals that will not be present in RPLC data.
Chemistry
What problem does this paper attempt to address?
This paper mainly discusses the limitations of reversed-phase liquid chromatography (RPLC) in untargeted analysis. The chemical space contains countless possible structures, some of which constitute human and environmental exposure groups. However, before analyzing the samples, the composition is unknown and may contain thousands of known and unknown chemical substances. The researchers proposed a data-driven approach to predict whether a molecule is "in", "possibly in", or "out" of the RPLC subspace. First, they constructed three retention index random forest (RF) regression models to demonstrate that molecular fingerprints can predict the retention behavior of RPLC. Then, these models were used to construct an RPLC classification model, which achieved an accuracy of 92% on the test set and correctly predicted whether a chemical substance belongs to the RPLC subspace. Finally, the model was applied to 91,737 small molecules in the NORMAN SusDat database, and it was found that 19.1% of the molecules are "out" of the RPLC subspace. This method helps reduce the number of potential candidates in untargeted analysis and suspect screening, reduce the possibility of misidentification, save computational time, and reduce the screening of chemicals that cannot be detected in RPLC data. The study also demonstrated the advantages of molecular fingerprints as an alternative way to describe molecular structures, which are easier to understand and compute compared to traditional quantitative structure-retention relationship models.