Predicting Micropollutant Removal by Reverse Osmosis and Nanofiltration Membranes: Is Machine Learning Viable?

Nohyeong Jeong,Tai-heng Chung,Tiezheng Tong
DOI: https://doi.org/10.1021/acs.est.1c04041
2021-08-03
Abstract:Predictive models for micropollutant removal by membrane separation are highly desirable for the design and selection of appropriate membranes. While machine learning (ML) models have been applied for such purposes, their reliability might be compromised by data leakage due to inappropriate data splitting. More importantly, whether ML models can truly understand the mechanisms of membrane separation has not been revealed. In this study, we evaluate the capability of the XGBoost model to predict micropollutant removal efficiencies of reverse osmosis and nanofiltration membranes. Our results demonstrate that data leakage leads to falsely high prediction accuracy. By utilizing a model interpretation method based on the cooperative game theory, we test the knowledge of XGBoost on the mechanisms of membrane separation via quantifying the contributions of input variables to the model predictions. We reveal that XGBoost possesses an adequate understanding of size exclusion, but its knowledge of electrostatic interactions and adsorption is limited. Our findings suggest that future work should focus more on avoiding data leakage and evaluating the mechanistic knowledge of ML models. In addition, high-quality data from more diverse experimental conditions, as well as more informative variables, are needed to improve the accuracy of ML models for predicting membrane performance.The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.est.1c04041.Details on PCA and K-fold cross-validation; input variables; absolute errors of the model predictions with and without data leakage (with detailed data); absolute errors of the model predictions as a function of input variables; comparison of MAE of our study with the literature; predictions of the model with different split ratios (i.e., training/validation data: testing data are 90/10, 80/20, or 70/30); and 10 replicates of the model predictions with data split ratios of 90/10 (PDF)Micropollutant removal efficiencies of commercial and self-fabricated membranes (XLSX)Micropollutant removal efficiencies of commercial and self-fabricated membranes (XLSX)This article has not yet been cited by other publications.
environmental sciences,engineering, environmental
What problem does this paper attempt to address?