Analysis of the Influences of Sampling Bias and Class Imbalance on Performances of Probabilistic Liquefaction Models
Ji-Lei Hu,Xiao-Wei Tang,Jiang-Nan Qiu
DOI: https://doi.org/10.1061/(asce)gm.1943-5622.0000808
IF: 3.918
2016-01-01
International Journal of Geomechanics
Abstract:Sampling bias and class imbalance are important parts of model uncertainty that have a significant impact on the predictive probability of classification models. This study analyzed the influences of sampling bias and class imbalance on the performance of four common methods used in 10 models for seismic liquefactionBayesian network (BN), artificial neural network (ANN), logistic regression (LR), and support vector machine (SVM)using controlled experiments based on penetration test (SPT) data from 350 standard case histories. The data are divided into two data sets with class distributions of 150:150 and 200:100, which are separately stratified and sampled to obtain 11 different cases of distributions (10:90, 20:80, 25:75, 33:67, 40:60, 50:50, 60:40, 67:33, 75:25, 80:20, and 90:10) to quantify the predictive performance of the four models using statistical model validation metrics, such as overall accuracy, area under the receiver operating characteristic curve, precision, recall, and F-score. The experiments show that the best distribution of liquefaction samples for training is not a fixed point but, rather, a range. The authors suggest that the best range of sample distribution is from 1 to 1.5 (liquefaction/nonliquefaction) for the BN method, from 0.67 to 1 for the ANN method, approximately 0.5 for the LR model, and from 0.5 to 1 for the SVM method. Furthermore, oversampling technology was used to try to improve the predictive capability of the four models for two samples (10:90 and 90:10) with bad class imbalance and sampling bias. The predictive performance of the oversampled sample considerably improved over the original samples with bad class imbalance and sampling bias for the LR model and the SVM polynomial (SVM-Pol) model rather than for the BN maximum likelihood estimation (BN-MLE) model and the ANN radial basis function (ANN-RBF) model. In addition, in the fields with unknown real distribution of classes in the population, when a training sample contains severe class imbalance or sampling bias, the authors recommend that researchers choose an oversampled sample that has the same class distribution as the population of the collected data to ensure optimal performance.