Through the lens of causal inference: Decisions and pitfalls of covariate selection

Gang Chen,Zhengchen Cai,Paul A Taylor
DOI: https://doi.org/10.1101/2024.01.11.575211
2024-05-17
Abstract:The critical importance of justifying the inclusion of covariates is a facet often overlooked in data analysis. While the incorporation of covariates typically follows informal guidelines, we argue for a comprehensive exploration of underlying principles to avoid significant statistical and interpretational challenges. Our focus is on addressing three common yet problematic practices: the indiscriminate lumping of covariates, the lack of rationale for covariate inclusion, and the oversight of potential issues in result reporting. These challenges, prevalent in neuroimaging models involving covariates such as reaction time, demographics, and morphometric measures, can introduce biases, including overestimation, underestimation, masking, sign flipping, or spurious effects. Our exploration of causal inference principles underscores the pivotal role of domain knowledge in guiding covariate selection, challenging the common reliance on statistical measures. This understanding carries implications for experimental design, model-building, and result interpretation. We draw connections between these insights and reproducibility concerns, specifically addressing the selection bias resulting from the widespread practice of strict thresholding, akin to the logical pitfall associated with "double dipping." Recommendations for robust data analysis involving covariates encompass explicit research question statements, justified covariate inclusions/exclusions, centering quantitative variables for interpretability, appropriate reporting of effect estimates, and advocating a "highlight, don't hide" approach in result reporting. These suggestions are intended to enhance the robustness, transparency, and reproducibility of covariate-driven analyses, encompassing investigations involving consortium datasets such as ABCD and UK Biobank. We discuss how researchers can use a transparent depiction of the covariate relationships to enhance the ethos of open science and promote research reproducibility.
Biology
What problem does this paper attempt to address?