A Data-Driven Supervised Machine Learning Approach to Estimating Global Ambient Air Pollution Concentrations With Associated Prediction Intervals

Liam J Berrisford,Hugo Barbosa,Ronaldo Menezes
2024-02-15
Abstract:Global ambient air pollution, a transboundary challenge, is typically addressed through interventions relying on data from spatially sparse and heterogeneously placed monitoring stations. These stations often encounter temporal data gaps due to issues such as power outages. In response, we have developed a scalable, data-driven, supervised machine learning framework. This model is designed to impute missing temporal and spatial measurements, thereby generating a comprehensive dataset for pollutants including NO$_2$, O$_3$, PM$_{10}$, PM$_{2.5}$, and SO$_2$. The dataset, with a fine granularity of 0.25$^{\circ}$ at hourly intervals and accompanied by prediction intervals for each estimate, caters to a wide range of stakeholders relying on outdoor air pollution data for downstream assessments. This enables more detailed studies. Additionally, the model's performance across various geographical locations is examined, providing insights and recommendations for strategic placement of future monitoring stations to further enhance the model's accuracy.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper proposes the problem of how to estimate global atmospheric pollution concentrations using data-driven supervised machine learning methods and provide prediction intervals. Currently, the challenge of dealing with cross-border atmospheric pollution problems arises due to the spatial sparsity and uneven distribution of monitoring station data, as well as data gaps. The paper introduces a scalable machine learning framework aimed at filling measurement gaps in time and space, and interpolating the concentrations of pollutants including NO2, O3, PM10, PM2.5, and SO2. The generated high-resolution (0.25°×hourly) dataset is accompanied by prediction intervals, suitable for various downstream assessments relying on outdoor air pollution data. In addition, the analysis of model performance across different geographical locations provides recommendations for future monitoring station placement to improve accuracy. The paper also explores the possibility of inferring air pollution levels in one country from the data of another country, as well as the comprehensive evaluation of global air quality.