Support vector machine: classifying and predicting mutagenicity of complex mixtures based on pollution profiles.
Weiwei Zheng,Dajun Tian,Xia Wang,Weidong Tian,Hao Zhang,Songhui Jiang,Gengsheng He,Yuxin Zheng,Weidong Qu
DOI: https://doi.org/10.1016/j.tox.2013.01.016
IF: 4.571
2013-01-01
Toxicology
Abstract:Powerful, robust in silico approaches offer great promise for classifying and predicting biological effects of complex mixtures and for identifying the constituents of greatest concern. Support vector machine (SVM) methods can deal with high dimensional data and small sample size and examine multiple interrelationships among samples. In this work, we applied SVM methods to examine pollution profiles and mutagenicity of 60 water samples obtained from 6 cities in China during 2006–2011. Pollutant profiles were characterized in water extracts by gas chromatography–mass spectrometry (GC/MS) and mutagenicity examined by Ames assays. We encoded feature vectors of GS–MS peaks in the mixtures and used 48 samples as the training set, reserving 12 samples as the test set. The SVM model and regression were constructed from whole pollution profiles that ranked compounds in relation to their correlation to the mutagenicity. Both classification and prediction performance were evaluated. The SVM model based on whole pollution profiles showed lower performance (sensitivity, specificity, accuracy and correlation coefficient were 69.5–70.7%, 70.6–73.2%, 69.9–72.1%, and 0.55–0.59%, respectively) than one based on compounds with highest association with mutagenicity. A SVM model with the top 10 compounds had the highest performance (sensitivity, specificity, accuracy, and correlation coefficient were 89.8–90.3%, 90.1–92.1%, 90.1–91.3%, and 0.80–0.82%, respectively), with negligible decreases in performance between the test and training set. SVM can be a powerful, robust classifier of the relationship of pollutants and mutagenicity in complex real-world mixtures. The top 14 compounds have the greatest contribution to mutagenicity and deserve further studies to identify these constituents.