Optimal selection of learning data for highly accurate QSAR prediction of chemical biodegradability: a machine learning-based approach

K. Takeda,K. Takeuchi,Y. Sakuratani,K. Kimbara
DOI: https://doi.org/10.1080/1062936X.2023.2251889
IF: 3.681
2023-09-08
SAR and QSAR in Environmental Research
Abstract:Prior to the manufacture of new chemicals, regulations mandate a thorough review of the chemicals under risk management. This review involves evaluating their effects on the environment and human health. To assess these effects, a review report that conforms to the OECD Test Guidelines must be submitted to the regulatory body. One of the essential components of the report is an assessment of the biodegradability of chemicals in the environment. In addition to conventional methods, quantitative structure-activity relationship (QSAR) models have been developed to predict the properties of chemicals based on their structural features. Although a greater number of chemicals in the learning set may enhance the prediction accuracy, it may also lead to a decrease in accuracy due to the mixing of different structural features and properties of the chemicals. To improve the prediction performance, it is recommended to use only the appropriate data for biodegradability prediction as a training set. In this study, we propose a novel approach for the optimal selection of training set that enables a highly accurate prediction of the biodegradability of chemicals by QSAR. Our findings indicate that the proposed method effectively reduces the root mean squared error and improves the prediction accuracy.
environmental sciences,toxicology,computer science, interdisciplinary applications,chemistry, multidisciplinary,mathematical & computational biology
What problem does this paper attempt to address?