Understanding the Aquatic Toxicity of Pesticide: Structure-Activity Relationship and Molecular Descriptors to Distinguish the Ratings of Toxicity

Gaoxue Wang,Yan Li,Xiaolin Liu,Yonghua Wang
DOI: https://doi.org/10.1002/qsar.200960050
2009-01-01
QSAR & Combinatorial Science
Abstract:The purpose of this work is to develop robust, interpretable structure-activity relationship (SAR) models for assessing the aquatic toxicity of pesticides. A data set of 1600 chemicals involving 533 nontoxic (C0), 287 slightly toxic (C1), 329 moderately toxic (C2), 231 highly toxic (C3), and 220 very highly toxic compounds (C4) to aquatic organisms were collected in this work. Their chemical structures were encoded into 196 molecular descriptors including the 2D topological, electrotopological state variables as well as the MlogP and AlogP parameters. Two variable selection techniques, i.e., the Stepwise procedure and the Genetic Algorithms (GA), coupled with the linear discriminant analysis (LDA) were used to obtain stable and thoroughly validated QSARs. Our results reveal that the AlogP is capable of classifying the C0 versus C4 compounds with an accuracy rate of 70.4%, but is poor between other groups, while the MlogP does not show any pronounced correlation for aquatic toxicity for all the groups. By using all the theoretical descriptors, the GA-LDA models for C(0,4) C(1,3), C(1,4), and C(2,4) classifications are acceptable with external prediction accuracies ranging from 66.3% to 80.6%. All these selected descriptors accounting for the molecular size, electrotopological state, and hydrophobicity were found to be crucial to modeling the aquatic toxicity. The robustness and the predictive performance of the proposed models were verified using both the internal (cross-validation by leave-one out, Y-scrambling) and external statistical validations (randomly selected). Our results demonstrate that the Genetic Algorithms have a huge advantage over the Stepwise procedure for generating more reliable models, but by using much less descriptors for all the data sets.
What problem does this paper attempt to address?