Identifying Effect Modification of Latent Population Characteristics on Risk Factors with a Sparse Varying Coefficient Regression

Ruofan Wang,Lei Fang,Yue Wang,Jin Jin
DOI: https://doi.org/10.1101/2024.11.30.626101
2024-12-05
Abstract:Leveraging observational data to understand the associations between risk factors and disease outcomes and conduct disease risk prediction is a common task in epidemiology. While traditional linear regression and other machine learning models have been extensively implemented for this task, the associations between risk factors and disease outcomes are typically deemed fixed. In many cases, however, such associations may vary by some underlying features of the individuals, which may involve certain subpopulation characteristics and environmental factors. While data for these latent features may not be available, the observed data on risk factors may have captured some proportion of the variation in these features. Thus extracting latent factors from risk factors and incorporating this effect modification into the model may better capture the underlying data structure and improve inference. We develop a novel regression model with some coefficients varying as functions of latent features extracted from the risk factors. We have demonstrated the superiority of our approach in various data settings via simulation studies. An application on a dataset for lung cancer patients from The Cancer Genome Atlas (TCGA) Program showed that our approach led to a 6% - 118% increase in (AUC-0.5) for distinguishing between different lung cancer stages compared to the classic lasso and elastic net regressions and identified interesting latent effect modifications associated with certain gene pathways.
Genetics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand the association between risk factors and disease outcomes using observational data in epidemiological studies and to predict disease risks. Although traditional linear regression and other machine - learning models have been widely applied to this task, these methods usually assume that the association between risk factors and disease outcomes is fixed. However, in many cases, this association may vary due to some underlying characteristics of individuals (such as certain sub - group characteristics and environmental factors). Although data on these underlying characteristics may not be available, the observational data of risk factors may have captured some of the variation in these characteristics. Therefore, extracting latent factors from risk factors and incorporating this effect modification into the model can better capture the underlying data structure and improve the accuracy of inference. Specifically, the paper proposes a novel regression model in which some coefficients vary as a function of latent characteristics extracted from risk factors. Through simulation studies and practical applications (for example, application on a lung cancer patient data set), the superiority of this method in different data settings has been demonstrated. For example, when distinguishing different lung cancer stages, compared with the classical LASSO and elastic - net regression, the AUC value of this method has increased by 6% - 118%, and interesting latent effect modifiers related to certain gene pathways have been identified.