Assessing the Effect of Data Integration on Predictive Ability of Cancer Survival Models

Yi Guo,Jiang Bian,Francois Modave,Qian Li,Thomas J. George,Mattia Prosperi,Elizabeth Shenkman
DOI: https://doi.org/10.1177/1460458218824692
2019-01-01
Health Informatics Journal
Abstract:Cancer is the second leading cause of death in the United States. To improve cancer prognosis and survival rates, a better understanding of multi-level contributory factors associated with cancer survival is needed. However, prior research on cancer survival has primarily focused on factors from the individual level due to limited availability of integrated datasets. In this study, we sought to examine how data integration impacts the performance of cancer survival prediction models. We linked data from four different sources and evaluated the performance of Cox proportional hazard models for breast, lung, and colorectal cancers under three common data integration scenarios. We showed that adding additional contextual-level predictors to survival models through linking multiple datasets improved model fit and performance. We also showed that different representations of the same variable or concept have differential impacts on model performance. When building statistical models for cancer outcomes, it is important to consider cross-level predictor interactions.
What problem does this paper attempt to address?