Explainable machine learning models for predicting the acute toxicity of pesticides to sheepshead minnow ( Cyprinodon variegatus )

Ting Sun,Chongzhi Wei,Yang Liu,Yueying Ren
DOI: https://doi.org/10.1016/j.scitotenv.2024.177399
IF: 9.8
2024-11-24
The Science of The Total Environment
Abstract:A quantitative structure–activity relationship (QSAR) study was conducted on 313 pesticides to predict their acute toxicity to Sheepshead minnow ( Cyprinodon variegatus ) by using DRAGON descriptors. Essentials accounting for a reliable model were all considered carefully, giving full consideration to the OECD (Organization for Economic Co-operation and Development) principles for QSAR acceptability in regulation during the model construction and assessment process. Nine variables were selected through the forward stepwise regression method and used as inputs to construct both linear and nonlinear models. The obtained models were validated internally and externally. Generally, machine learning-based methods, namely support vector machine (SVM), random forest (RF), and projection pursuit regression (PPR), perform better than the multiple linear regression (MLR) model. The statistical results ( R 2 = 0.682–0.933, Q 2 LOO = 0.604–0.659, Q 2 F1 = 0.740–0.796, CCC = 0.861–0.882) of the developed models show that they are robust, reliable, reproducible, accurate and predictive. Comparatively, the RF model performs best, giving predictive correlation coefficient Q 2 of 0.814, root mean squared error ( RMSE ) of 0.658 and mean absolute error ( MAE ) of 0.534 for the test set, respectively. The RF model (as well as SVM and PPR models) was visualized and explained by using the SHapley Additive explanation (SHAP) analysis to enhance its transparency and credibility. In addition, the applicability domain (AD) range of the RF model was characterized by the Williams plot and the tree manifold approximation and projection (TMAP) technology was utilized to illustrate similarity and diversity of the entire data space, to assist in the analysis of the outliers. Activity cliff detection was investigated by using Arithmetic Residuals in K-groups Analysis (ARKA) descriptors. It was found that none of the pesticides was identified as an activity cliff in the training set or a potential prediction cliff in the test set. Therefore, the RF model fulfills each OECD principle in regulation for QSAR models. The research in this work will aid in the in silico QSAR prediction of the acute toxicity to Sheepshead minnow ( Cyprinodon variegatus ) for untested and new toxic pesticides and can also be extended to other studies.
environmental sciences
What problem does this paper attempt to address?