Evaluating the Validity of Class Balancing Algorithms-Based Machine Learning Models for Geogenic Contaminated Groundwaters Prediction

Hailong Cao,Xianjun Xie,Jianbo Shi,Yanxin Wang
DOI: https://doi.org/10.1016/j.jhydrol.2022.127933
IF: 6.4
2022-01-01
Journal of Hydrology
Abstract:Data-driven machine learning models have been used to predict hazardous substances levels in groundwater. However, class-imbalanced data results in models that may show grossly low sensitivity even though they show high overall accuracy. To address this issue, four algorithms - weighted cross-entropy loss, Random over-sampling, Random undersampling, and Adaptive synthetic sampling (ADASYN) - were tested for their validity in improving model sensitivity. Testing of the above four algorithms using geogenic high arsenic groundwater data from the Datong Basin, the Red River Delta of Vietnam, Bangladesh, Texas and California showed that all four algorithms produced more accurate predictions with an average increase in sensitivity of 53.8% compared to the raw models. The ADASYN is the best of the four algorithms and can increase model G-means (geometric mean of sensitivity and specificity) by >40% on average. The ADASYN-optimized ANN models predicted higher groundwater As exposure risk in Ghana than that in Ethiopia.
What problem does this paper attempt to address?