Prioritization of unknown features based on predicted toxicity categories

Viktoriia Turkina,Jelle T. Gringhuis,Sanne Boot,Annemieke Petrignani ,Antonia Praetorius,Jake W. O'Brien,Garry Corthals,Saer Samanipour
DOI: https://doi.org/10.26434/chemrxiv-2024-h8kbq
2024-11-21
Abstract:Complex environmental samples contain a diverse array of known and unknown constituents. While Liquid Chromatography coupled with High-Resolution Mass Spectrometry (LC-HRMS) Non-Targeted Analysis (NTA) has emerged as an essential tool for the comprehensive study of such samples, the identification of individual constituents remains a significant challenge, primarily due to the vast number of detected features in each sample. To address this, prioritization strategies are frequently employed to narrow the focus to the most relevant features for further analysis. In this study, we developed a novel prioritization strategy that directly links fragmentation and chromatographic data to aquatic toxicity categories, bypassing the need for individual compound identification. Given that features are not always well-characterized through fragmentation, we created two models: 1) a Random Forest Classification (RFC) model, which classifies fish toxicity categories based on MS1, retention, and fragmentation data---expressed as cumulative neutral losses (CNLs)---when fragmentation information is available, and 2) a Kernel Density Estimation (KDE) model that relies solely on retention time and MS1 data when fragmentation is absent. Both models demonstrated accuracy comparable to structure-based prediction methods. We further tested the models on a pesticide mixture in a tea extract measured by LC-HRMS, where the CNLs-based RFC model achieved 0.76 accuracy and the KDE model reached 0.61, showcasing their robust performance in real-world applications.
Chemistry
What problem does this paper attempt to address?