QSAR Modelling Study of the Bioconcentration Factor and Toxicity of Organic Compounds to Aquatic Organisms Using Machine Learning and Ensemble Methods.
Haixin Ai,Xuewei Wu,Li Zhang,Mengyuan Qi,Ying Zhao,Qi Zhao,Jian Zhao,Hongsheng Liu
DOI: https://doi.org/10.1016/j.ecoenv.2019.04.035
IF: 7.129
2019-01-01
Ecotoxicology and Environmental Safety
Abstract:Bioconcentration factors and median lethal concentrations (LC50s) are important when assessing risks posed by organic pollutants to aquatic ecosystems. Various quantitative structure-activity relationship models have been developed to predict bioconcentration factors and classify acute toxicity. In the study, we developed a regression model using Recursive Feature Elimination (RFE) method combined with the Support Vector Machine (SVM) algorithm. We calculated 2D molecular descriptors from a dataset containing 450 diverse chemicals in our regression model. Then we built three ensemble models using three machine learning algorithms and calculated 12 molecular fingerprints from a dataset containing 400 diverse chemicals in our classification models. In the regression model, the R2 and Rpred2 for the regression model were 0.860 and 0.757, respectively. Other parameters indicated that the regression model made good predictions and could efficiently predict a new set of compounds following standards set by Golbraikh, Tropsha, and Roy. In the classification models, the ensemble-SVM classification model gave an overall accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve of 92.2, 95.1, 86.0, and 0.965, respectively, in a five-fold cross-validation and of 87.3, 92.6, 76.0, and 0.940, respectively, in an external validation. These parameters indicated that our ensemble-SVM model was more stable and gave more accurate predictions than previous models. The model could therefore be used to effectively predict aquatic toxicity and assess risks posed to aquatic ecosystems. We identified several structures most relevant to acute aquatic toxicity through predictions made by the two types of models, and this information may be important to aquatic toxicology experiments and aquatic system management.