Considerations for using tree-based machine learning to assess causation between demographic and environmental risk factors and health outcomes
Daniela Galatro,Alessia Di Nardo,Varun Pai,Rosario Trigo-Ferre,Melanie Jeffrey,Maria Jacome,Vincenzo Costanzo-Alvarez,Jason Bazylak,Cristina H. Amon
DOI: https://doi.org/10.1007/s11356-024-35304-4
IF: 5.8
2024-10-13
Environmental Science and Pollution Research
Abstract:Evaluation of the heterogeneous treatment effect (HTE) allows for the assessment of the causal effect of a therapy or intervention while considering heterogeneity in individual factors within a population. Machine learning (ML) methods have previously been employed for HTE evaluation, addressing the limitations associated with modelling complex systems. In this work, three tree-based ML algorithms, causal random forest (CRF), causal Bayesian additive regression trees (CBART), and causal rule ensemble (CRE), are used to analyze the potential causation of benzene exposure to cause childhood acute myeloid leukemia (AML). Data for this analysis is generated by drawing samples from a previously developed model that estimates AML probability given as input demographic information and benzene exposure. Comparison is drawn between the three tree-based algorithms in terms of the predicted average treatment effect (ATE), the regression coefficient of determination, and the computational time of each algorithm. Minimal difference is reported between the three tree-based algorithms in terms of the ATE, as well as the regression coefficient of determination. However, CRF outperforms CBART in terms of algorithm computational time. Moreover, CRF allows for both continuous and binary treatment variables, as opposed to CBART and CRE, making it better suited to environmental health studies, where exposure levels of pollutants shall be considered continuous. Following the comparison of all three algorithms, the influence of adding Gaussian noise to the treatment and outcome variables, as well as outliers, is investigated using CRF. A set of considerations is drawn to guide researchers in using these algorithms. These considerations detail the simulation settings, applications, and results interpretation and aim to provide prompt information in decision-making surrounding the establishment of pollutant exposure thresholds in environmental risk assessments.
environmental sciences