Predicting the Activity of Unidentified Chemicals in Complementary Bioassays from the HRMS Data to Pinpoint Potential Endocrine Disruptors

Ida Rahu,Meelis Kull,Anneli Kruve
DOI: https://doi.org/10.1021/acs.jcim.3c02050
IF: 6.162
2024-03-24
Journal of Chemical Information and Modeling
Abstract:The majority of chemicals detected via nontarget liquid chromatography high-resolution mass spectrometry (HRMS) in environmental samples remain unidentified, challenging the capability of existing machine learning models to pinpoint potential endocrine disruptors (EDs). Here, we predict the activity of unidentified chemicals across 12 bioassays related to EDs within the Tox21 10K dataset. Single- and multi-output models, utilizing various machine learning algorithms and molecular fingerprint features as an input, were trained for this purpose. To evaluate the models under near real-world conditions, Monte Carlo sampling was implemented for the first time. This technique enables the use of probabilistic fingerprint features derived from the experimental HRMS data with SIRIUS+CSI:FingerID as an input for models trained on true binary fingerprint features. Depending on the bioassay, the lowest false-positive rate at 90% recall ranged from 0.251 (sr.mmp, mitochondrial membrane potential) to 0.824 (nr.ar, androgen receptor), which is consistent with the trends observed in the models' performances submitted for the Tox21 Data Challenge. These findings underscore the informativeness of fingerprint features that can be compiled from HRMS in predicting the endocrine-disrupting activity. Moreover, an in-depth SHapley Additive exPlanations analysis unveiled the models' ability to pinpoint structural patterns linked to the modes of action of active chemicals. Despite the superior performance of the single-output models compared to that of the multi-output models, the latter's potential cannot be disregarded for similar tasks in the field of <i>in silico</i> toxicology. This study presents a significant advancement in identifying potentially toxic chemicals within complex mixtures without unambiguous identification and effectively reducing the workload for postprocessing by up to 75% in nontarget HRMS.
chemistry, multidisciplinary, medicinal,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?
This paper aims to solve the problem of predicting the activities of unrecognized chemicals in 12 complementary bioassays through high - resolution mass spectrometry (HRMS) data to identify potential endocrine disruptors (EDs). Specifically, the paper focuses on the following key points: 1. **Background problems**: - Most chemicals detected in environmental samples have not been clearly identified, which poses a challenge to existing machine - learning models in identifying potential endocrine disruptors. - Traditional methods usually require data with known chemical structures as input, but in the actual environment, the structural information of many chemicals is missing. 2. **Research objectives**: - Use the molecular fingerprint features generated from high - resolution mass spectrometry (HRMS) data to train single - output and multi - output models to predict the activities of unrecognized chemicals in 12 bioassays related to endocrine disruption. - Evaluate the performance of these models under approximate real - world conditions, especially when dealing with probabilistic fingerprint features. 3. **Methods and techniques**: - Use a variety of machine - learning algorithms (such as linear discriminant analysis, logistic regression, naive Bayes, k - nearest neighbor, support vector machine, decision tree, random forest, gradient boosting, etc.) to build single - output models. - Build multi - output models and use deep neural networks (DNN) to simultaneously predict the chemical activities in multiple bioassays. - Introduce the Monte Carlo sampling technique to convert the probabilistic fingerprint features extracted from experimental HRMS data into binary fingerprint features to improve the prediction accuracy of the models. 4. **Main findings**: - The lowest false positive rate (FPR) of the single - output models in different bioassays is between 0.196 and 0.670, indicating that this method has high prediction accuracy in some bioassays. - Although the multi - output model performs worse than the single - output model in some cases, its potential cannot be ignored, especially when dealing with chemicals in complex mixtures. - SHapley Additive exPlanations (SHAP) analysis reveals that the model can identify structural patterns related to the action modes of active chemicals. 5. **Significance**: - This research provides an effective method for identifying potentially toxic chemicals in complex mixtures without clearly identifying each chemical, thereby significantly reducing the workload of subsequent processing. - By using the molecular fingerprint features generated from HRMS data, this method fills the gaps in existing data, solves the ethical problems of animal testing, and overcomes the time and resource limitations of traditional toxicity testing methods. In conclusion, through innovative methods and techniques, this paper significantly improves the ability to identify potential endocrine disruptors in complex mixtures, providing new tools and ideas for environmental monitoring and risk assessment.