Long-term hourly air quality data bridging of neighboring sites using automated machine learning: A case study in the Greater Bay area of China
Boxi Wu,Cheng Wu,Yuchen Ye,Chenglei Pei,Tao Deng,Yong Jie Li,Xingcheng Lu,Lei Wang,Bin Hu,Mei Li,Dui Wu
DOI: https://doi.org/10.1016/j.atmosenv.2024.120347
IF: 5
2024-03-01
Atmospheric Environment
Abstract:Long-term air pollution data are essential for formulating air quality management policies and assessing their corresponding impacts on public health. However, missing data are inevitably encountered during air pollution observations at different sites. This study proposed a machine learning approach that utilizes data from neighboring sites to reconstruct missing data. Hourly observation data from three neighboring sites in the Pearl River Delta (PRD) region in South China, were used for data retrieval, including the NC site (2006–2015), JXL site and PYZX site (2014–2022). The overlapped data (2014.05–2015.12) were used to train and evaluate the machine learning models. The performance of 11 algorithms (CatBoost, XGBoost, LightGBM, LightGBMXT, LightGBMLarge, RandomForestMSE, ExtraTreeMSE, NeuralNetTorch, NeuralNetFastAI, KNeighborsDist, and KNeighborsUnif) for the retrieval of major air pollutants, including O3, NO2, PM2.5, PM10 and SO2 was benchmarked by a set of evaluation metrics. CatBoost showed the best performance; thus, it was adopted for air pollutant data reconstruction in NC (2016–2022) and PYZX (2008–2014). Long-term data (2006–2022) at the NC were obtained by combining the observation and retrieval data. In the past 15 years, the O3 concentration of NC has increased by 72% at a rate of 0.83 ppb yr−1 (3.2% yr−1). On the contrary, substantial reductions were observed for NO2 (61%), PM2.5 (51%) and PM10 (42%) at the NC site, with the rates of −1.27 ppb yr−1 (−5.9% yr−1), −1.96 μg m−3 yr−1 (−5.8% yr−1) and −2.32 μg m−3 yr−1 (−5.2% yr−1), respectively. SO2 exhibits the most pronounced reduction (79%) among all species, with two distinct rates of −4.10 ppb yr−1 (−27.4% yr−1) and −0.40 ppb yr−1 (−6.2% yr−1), for 2008–2012 and 2012–2022, respectively. This study demonstrates the feasibility of machine learning in filling the data gap of air pollution monitoring network and highlights the importance of continuous long-term air pollution data in reviewing air quality management policies.
environmental sciences,meteorology & atmospheric sciences