Miss-gradient boosting regression tree: a novel approach to imputing water treatment data

Wen Zhang,Rui Li,Jiangpeng Zhao,Jiawei Wang,Xiaoyu Meng,Qun Li
DOI: https://doi.org/10.1007/s10489-023-04828-6
IF: 5.3
2023-07-05
Applied Intelligence
Abstract:Complete data on wastewater quality are essential for managing and monitoring wastewater treatment processes. Most management and monitoring methods involve the use of voluminous training data for imputation, but the problem is that the sensors used in wastewater treatment plants (WWTPs) collect only a limited amount of data. The lack of sufficient training data can diminish the accuracy of traditional imputation techniques. To address this problem, this study developed a novel approach called Miss-GBRT (imputing m issing values with g radient b oosting r egression t rees), which can impute missing values into wastewater quality data even with minimal training data. The proposed approach consists of a preprocessing stage and an imputation stage. In the preprocessing stage, different copies of masked datasets are produced from raw data according to various levels of missingness, after which pre-imputation is conducted to ensure the integrality of training data. In the imputation stage, Miss-GBRT is used to combine shallow regression trees to regress the residuals of time and impute each missing value into a masked dataset in a stepwise manner. We carried out extensive experiments on the WWTP datasets of the University of California, Irvine and Beijing Drainage Group to compare Miss-GBRT with baseline imputation methods. The results demonstrated that the proposed approach improves the accuracy with which missing wastewater quality data are imputed under limited training data. It can also perform better than other methods on datasets with considerable proportions of missing values.
computer science, artificial intelligence
What problem does this paper attempt to address?