Prediction of Chemical Carcinogenicity by Machine Learning Approaches.

N. X. Tan,H. B. Rao,Z. R. Li,X. Y. Li
DOI: https://doi.org/10.1080/10629360902724085
IF: 3.681
2009-01-01
SAR and QSAR in Environmental Research
Abstract:In this paper we report a successful application of machine learning approaches to the prediction of chemical carcinogenicity. Two different approaches, namely a support vector machine (SVM) and artificial neural network (ANN), were evaluated for predicting chemical carcinogenicity from molecular structure descriptors. A diverse set of 844 compounds, including 600 carcinogenic (CG+) and 244 noncarcinogenic (CG−) molecules, was used to estimate the accuracies of these approaches. The database was divided into two sets: the model construction set and the independent test set. Relevant molecular descriptors were selected by a hybrid feature selection method combining Fischer's score and Monte Carlo simulated annealing from a wide set of molecular descriptors, including physiochemical properties, constitutional, topological, and geometrical descriptors. The first model validation method was based a five-fold cross-validation method, in which the model construction set is split into five subsets. The five-fold cross-validation was used to select descriptors and optimise the model parameters by maximising the averaged overall accuracy. The final SVM model gave an averaged prediction accuracy of 90.7% for CG+ compounds, 81.6% for CG− compounds and 88.1% for the overall accuracy, while the corresponding ANN model provided an averaged prediction accuracy of 86.1% for CG+ compounds, 79.3% for CG− compounds and 84.2% for the overall accuracy. These results indicate that the hybrid feature selection method is very efficient and the selected descriptors are truly relevant to the carcinogenicity of compounds. Another model validation method, i.e. a hold-out method, was used to build the classification model using the selected descriptors and the optimised model parameters, in which the whole model construction set was used to build the classification model and the independent test set was used to test the predictive ability of the model. The SVM model gave a prediction accuracy of 87.6% for CG+ compounds, 79.1% for CG− compounds and 85.0% for the overall accuracy. The ANN model gave a prediction accuracy of 85.6% for CG+ compounds, 79.1% for CG− compounds and 83.6% for the overall accuracy. The results indicate that the built models are potentially useful for facilitating the prediction of chemical carcinogenicity of untested compounds.
What problem does this paper attempt to address?