Abstract:The immense problem of missing satellite aerosol retrievals (Aerosol Optical Depth, (AOD)) detrimentally affects the prediction ability of ground-level PM2.5 concentrations and may lead to unavoidable biases. An appropriate missing-imputation method has not been well developed to date. This study developed a two-stage approach (AOD-imputation stage and PM2.5-prediction stage) to predict short-term PM2.5 exposure in mainland China from 2013–2018. At the AOD-imputation stage, geostatistical methods and machine learning (ML) algorithms were examined to interpolate 1 km satellite aerosol retrievals. At the PM2.5-prediction stage, the daily levels of PM2.5 were predicted at a resolution of 1 km, based on interpolated AOD and meteorological data. The statistical performances of the different interpolation methods were comprehensively compared at each stage. The original coverage of retrieved AOD was 15.46% on average. For the AOD-imputation stage, ML methods produced a higher coverage (98.64%) of AOD than geostatistical methods (21.43–87.31%). Among ML algorithms, random forest (RF) or extreme gradient boosted (XG-interpolated) AOD produced better interpolated quality (CV R2 = 0.89 and 0.85) than other algorithms (0.49–0.78), but XGBoost required only 15% of the computing time of RF. For the PM2.5 predicted stage, neither RF-AOD nor XG-AOD could guarantee higher accuracy in PM2.5 estimations (CV R2 = 0.88 (RF or XG-AOD) compared to 0.85 (original)), or more stable spatial and temporal extrapolation (spatial, (temporal) CV R2 = 0.83 (0.83), 0.82 (0.82), and 0.65 (0.61) for RF, XG, and original). For the AOD-imputation stage, the missing-filled efficiency depended more on external information, while the missing-filled accuracy relied more on model structure. For the PM2.5 predicted stage, efficient AOD interpolation (or the ability to eliminate the missing data) was a precondition for the stable spatial and temporal extrapolation, while the quality of interpolated AOD showed less significant improvements. It was found that XG-AOD is a better choice to estimate daily PM2.5 exposure in health assessments.

Comparison of Imputation Methods for Missing Values in Air Pollution Data: Case Study on Sydney Air Quality Index

Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

Comparison of Different Missing-Imputation Methods for MAIAC (Multiangle Implementation of Atmospheric Correction) AOD in Estimating Daily PM2.5 Levels

The impact of data imputation on air quality prediction problem

Missing Traffic Data: Comparison of Imputation Methods

CHOOSING APPROPRIATE IMPUTATION METHODS FOR MISSING DATA: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Performance Comparison of Hot-Deck Imputation, K-Nearest Neighbor Imputation, and Predictive Mean Matching in Missing Value Handling, Case Study: March 2019 SUSENAS Kor Dataset

Comparison of Missing Data Imputation Methods in Time Series Forecasting

Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction

A Comparison of Three Popular Methods for Handling Missing Data: Complete-Case Analysis, Inverse Probability Weighting, and Multiple Imputation

Multiview data fusion technique for missing value imputation in multisensory air pollution dataset

A transferred spatio-temporal deep model based on multi-LSTM auto-encoder for air pollution time series missing value imputation

Comparison of missing data imputation methods using weather data

Comparison of Performance of Data Imputation Methods for Numeric Dataset

A Spatiotemporal Approach for Traffic Data Imputation with Complicated Missing Patterns

On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

Comparison of different Methods for Univariate Time Series Imputation in R

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

An Experimental Survey of Missing Data Imputation Algorithms

Handling missing data in near real-time environmental monitoring: A system and a review of selected methods

Autoregressive-Model-Based Methods for Online Time Series Prediction with Missing Values: an Experimental Evaluation