Temporal Heterogeneity in the Performance of Machine Learning Models for PM2.5 Concentration Estimation

Peizheng Li,Shiqi Huang,Chenxi Luo,Xiangying Li,Qingyu Zhang,Jing Wang,Can Yang,Haomin Yang,Jianpeng Liao,Qihao Chen,Lu Ma
DOI: https://doi.org/10.1016/j.psep.2024.06.115
IF: 7.8
2024-01-01
Process Safety and Environmental Protection
Abstract:Machine learning (ML) methods have been applied extensively to simulate air pollutant concentrations and assess individual exposure in epidemiological studies. However, there is still a paucity of research on the temporal heterogeneity of ML model performance and the impact of dataset size. To explore the temporal heterogeneity in model performance when estimating daily concentrations of fine particulate matter (PM2.5) across China in 2021, we compared five decision tree-based ML models (Random Forest (RF), Categorical Boosting (CatBoost), Gradient Boost Regression Tree (GBRT), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM)) across daily scales within three distinct timeframes. The performance of all models was evaluated using cross-validation. We observed that the performance of ML models varied with time, which showed a significant correlation with PM2.5 concentration. Among the 365 days in 2021, RF model performed best, the annual mean R2 was 0.86, with a minimum of 0.84 and a maximum of up to 0.95. For RF, we chose a cubic polynomial curve to fit the relationship between model performance and PM2.5 concentrations, and based on this, we devised a model selection strategy for different time scales, achieving an accuracy rate of up to 79.45%, with the selected models having an average R2 of 0.85, and a maximum of up to 0.95. Additionally, we found that increasing the dataset size did not significantly improve model performance. Instead, it resulted in considerably longer runtime and increased memory usage. The methodology and findings of this study hold significant value for advancing the development of more efficient and precise modeling approaches for air pollutant concentrations. Furthermore, this research provides a foundation for regional air pollutant governance and future health-related research.
What problem does this paper attempt to address?