Super-learning and Ensemble Weighted Averaging Models to Predict Hyperlocal Long-Term Exposure to Fine Particulate Matter Components in the United States
Heresh Amini,Mahdieh Danesh Yazdi,Qian Di,Weeberb Requia,Yara Abu Awad,Liuhua Shi,Meredith Franklin,Choong Min Kang,Jack Mikhail Wolfson,Peter James,Rima Habre,Seyed Mahmood Taghavi Shahri,Zorana Jovanovic Andersen,Itai Kloog,Petros Koutrakis,Joel Schwartz
DOI: https://doi.org/10.1289/isee.2021.p-231
2021-01-01
ISEE Conference Abstracts
Abstract:BACKGROUND AND AIM: Fine particulate matter (PM2.5) mass is classified as carcinogenic for humans and linked to mortality and morbidity; however, less is known about health risks of PM2.5 components. We aimed to predict PM2.5 components across the contiguous United States. METHODS: Daily mean PM2.5 component data (EC, OC, NO3, NH4, SO4, Br, Ca, Cu, Fe, Ni, K, Pb, Si, V, and Zn) were obtained from the EPA and several other sources. Annual means were calculated at 987 monitoring sites from 2000 to 2019. About 160 predictor variables were used for modeling, such as traffic counts, distance to OpenStreetMap features, and satellite observations available through Google Earth Engine. After partitioning data into 70% training and 30% testing sets, two separate modeling approaches were developed for non-urban vs urban areas using Microsoft Azure. In non-urban areas, six machine-learning (ML) algorithms were fit on the training set to predict at 1 km spatial resolution and were integrated using six super-learners (SL) and GAM-based ensemble weighted-averaging (ENWA). In 3,535 urban areas, models were trained on a 50 m spatial grid and predictions from three ML algorithms were integrated using four SLs and an ENWA. The trained models were assessed using 10-fold cross validation and externally validated on the test set. RESULTS:Support vector machines with polynomial kernel SL outperformed other models across most of PM2.5 components. The minimum and maximum R2 for non-urban areas in unseen test sets were, respectively, 0.826 (Br) and 0.975 (SO4). In urban areas, these were 0.821 (Br) and 0.973 (SO4). The median R2 value on test sets across all models and components was 0.91. CONCLUSIONS:Our high resolution and hyperlocal predictions across 20 years will enable new epidemiological studies of the health risks of PM2.5 components that were not previously possible in the contiguous United States. KEYWORDS: PM2.5 components, machine-learning, super-learning, ensemble, United States