Data imbalance causes underestimation of high ozone pollution in machine learning models: a weighted support vector regression solution

Ling Zhen,Baihua Chen,Lin Wang,Lin Yang,Wei Xu,Ru-Jin Huang
DOI: https://doi.org/10.1016/j.atmosenv.2024.120952
IF: 5
2024-11-30
Atmospheric Environment
Abstract:Machine learning (ML) models have been widely utilized for the prediction of ground-level ozone (O 3 ), one of the most concerning air pollutants in China. However, many of the ML models tend to underestimate high O 3 levels, likely due to the class imbalance issue within the input training data. In this study, we combined data from ground monitoring stations (CO, NO 2 , SO 2 , PM 10 , PM 2.5 , MDA8 O 3 , longitude, and latitude), satellite observations (HCHO column concentration) and meteorological variables (2m temperature, solar radiation, relative humidity, wind components, surface pressure, precipitation, evaporation, boundary layer height, and cloud cover) to assess the impact of data imbalance on prediction performance. Results demonstrated that ML models without considering data imbalance issue severely underestimate high O 3 levels. We proposed a sample weighting-based support vector regression model (SVR-W) that fully considered the data imbalance. Based on data from 2026 monitoring stations across 31 provinces in China, the SVR-W model achieved an average bin-slope value between observed and predicted O 3 of 0.74 ± 0.12 (s.d.), which is significantly better than commonly used ML models (0.64 ± 0.11). The average bin-RMSE of the SVR-W model was 26.7 μg m -3 , outperforming other models. The average recall of high O 3 levels was 11-31% higher than commonly used ML models. The SVR-W model demonstrated improved prediction performance for both regular and extreme pollutant O 3 levels based on the three metrics. Our findings suggest that addressing data imbalance is crucial when applying ML models to environmental data.
environmental sciences,meteorology & atmospheric sciences
What problem does this paper attempt to address?