Application of machine learning and deep learning methods for hydrated electron rate constant prediction
Shanshan Zheng,Wanqian Guo,Chao Li,Yongbin Sun,Qi Zhao,Hao Lu,Qishi Si,Huazhe Wang
DOI: https://doi.org/10.1016/j.envres.2023.115996
IF: 8.3
2023-04-01
Environmental Research
Abstract:Accurately determining the second-order rate constant with e<sub>aq</sub><sup>-</sup> (k<sub>eaq-</sub>) for organic compounds (OCs) is crucial in the e<sub>aq</sub><sup>-</sup> induced advanced reduction processes (ARPs). In this study, we collected 867 k<sub>eaq-</sub> values at different pHs from peer-reviewed publications and applied machine learning (ML) algorithm-XGBoost and deep learning (DL) algorithm-convolutional neural network (CNN) to predict k<sub>eaq-</sub>. Our results demonstrated that the CNN model with transfer learning and data augmentation (CNN-TL&DA) greatly improved the prediction results and overcame over-fitting. Furthermore, we compared the ML/DL modeling methods and found that the CNN-TL&DA, which combined molecular images (MI), achieved the best overall performance (R<sup>2</sup><sub>test</sub> = 0.896, RMSE<sub>test</sub> = 0.362, MAE<sub>test</sub> = 0.261) when compared to the XGBoost algorithm combined with Mordred descriptors (MD) (0.692, RMSE<sub>test</sub> = 0.622, MAE<sub>test</sub> = 0.399) and Morgan fingerprint (MF) (R<sup>2</sup><sub>test</sub> = 0.512, RMSE<sub>test</sub> = 0.783, MAE<sub>test</sub> = 0.520). Moreover, the interpretation of the MD-XGBoost and MF-XGBoost models using the SHAP method revealed the significance of MDs (e.g., molecular size, branching, electron distribution, polarizability, and bond types), MFs (e.g, aromatic carbon, carbonyl oxygen, nitrogen, and halogen) and environmental conditions (e.g., pH) that effectively influence the k<sub>eaq-</sub> prediction. The interpretation of the 2D molecular image-CNN (MI-CNN) models using the Grad-CAM method showed that they correctly identified key functional groups such as -CN, -NO<sub>2</sub>, and -X functional groups that can increase the k<sub>eaq-</sub> values. Additionally, almost all electron-withdrawing groups and a small part of electron-donating groups for the MI-CNN model can be highlighted for estimating k<sub>eaq-</sub>. Overall, our results suggest that the CNN approach has smaller errors when compared to ML algorithms, making it a promising candidate for predicting other rate constants.
environmental sciences,public, environmental & occupational health