Identification of DNA adduct formation of small molecules by molecular descriptors and machine learning methods

hanbing rao,xianyin zeng,yanying wang,hua he,feng zhu,zerong li,yuzong chen
DOI: https://doi.org/10.1080/08927022.2011.616891
2012-01-01
Molecular Simulation
Abstract:In this study, we developed new computational DNA adduct prediction models by using significantly more diverse training data-set of 217 DNA adducts and 1024 non-DNA adducts, and applying five machine learning methods which include support vector machine (SVM), k-nearest neighbour, artificial neural networks, logistic regression and continuous kernel discrimination. The molecular descriptors used for DNA adduct prediction were selected from a pool of 548 descriptors by using a multi-step hybrid feature selection method combining Fischer-score and Monte Carlo simulated annealing method. Some of the selected descriptors are consistent with the structural and physicochemical properties reported to be important for DNA adduct formation. The y-scrambling method was used to test whether there is a chance correlation in the developed SVM model. In the meantime, fivefold cross-validation of these machine learning methods results in the prediction accuracies of 64.1-82.5% for DNA adducts and 95.1-97.6% for non-DNA adducts, and the prediction accuracies for external test set are 78.2-100% for DNA adducts and 92.6-98.4% for non-DNA adducts. Our study suggested that the tested machine learning methods are potentially useful for DNA adducts identification.
What problem does this paper attempt to address?