Imputation-Based Variable Selection Method for Block-Wise Missing Data When Integrating Multiple Longitudinal Studies

Zhongzhe Ouyang,Lu Wang,Alzheimer's Disease Neuroimaging Initiative
DOI: https://doi.org/10.3390/math12070951
IF: 2.4
2024-03-24
Mathematics
Abstract:When integrating data from multiple sources, a common challenge is block-wise missing. Most existing methods address this issue only in cross-sectional studies. In this paper, we propose a method for variable selection when combining datasets from multiple sources in longitudinal studies. To account for block-wise missing in covariates, we impute the missing values multiple times based on combinations of samples from different missing pattern and predictors from different data sources. We then use these imputed data to construct estimating equations, and aggregate the information across subjects and sources with the generalized method of moments. We employ the smoothly clipped absolute deviation penalty in variable selection and use the extended Bayesian Information Criterion criteria for tuning parameter selection. We establish the asymptotic properties of the proposed estimator, and demonstrate the superior performance of the proposed method through numerical experiments. Furthermore, we apply the proposed method in the Alzheimer's Disease Neuroimaging Initiative study to identify sensitive early-stage biomarkers of Alzheimer's Disease, which is crucial for early disease detection and personalized treatment.
mathematics
What problem does this paper attempt to address?
The paper attempts to address the issue of block-wise missing data encountered when integrating multiple longitudinal study datasets. Specifically, most existing methods primarily target block-wise missing data in cross-sectional studies, whereas in longitudinal studies, this missing pattern is more complex because data for each subject may be partially missing at different time points. The paper proposes a variable selection method based on multiple imputation to handle block-wise missing data from multiple data sources. ### Main Issues: 1. **Handling Block-Wise Missing Data**: How to effectively handle block-wise missing data when integrating multiple longitudinal study datasets, especially when the missing proportion is high and the number of covariates is large. 2. **Variable Selection**: How to perform effective variable selection in the presence of block-wise missing data to identify early biomarkers related to Alzheimer's Disease (AD). ### Background: - **Multi-Source Data**: In modern scientific research, multi-source data is receiving increasing attention, but these data often have block-wise missing issues. - **Alzheimer's Disease Neuroimaging Initiative (ADNI)**: The ADNI dataset contains a large amount of block-wise missing data, particularly in data from cognitively normal (NC), mild cognitive impairment (MCI), and Alzheimer's Disease patients. - **Limitations of Existing Methods**: Traditional methods such as complete case analysis and maximum likelihood methods are inefficient in handling high proportions of missing data and cannot handle multiple missing patterns. ### Solution: - **Multiple Imputation**: Impute missing values multiple times, using predictors from different missing patterns and different data sources to fill in the missing data. - **Estimating Equations**: Construct estimating equations based on imputed data and use the Generalized Method of Moments (GMM) to integrate information. - **Variable Selection**: Introduce Smoothly Clipped Absolute Deviation (SCAD) for variable selection and use the Extended Bayesian Information Criterion (EBIC) to select tuning parameters. ### Application: - **Alzheimer's Disease Research**: Apply the proposed method to the ADNI dataset to identify early biomarkers of Alzheimer's Disease, which is significant for early disease detection and personalized treatment. ### Summary: The paper proposes a new variable selection method based on multiple imputation to address the issue of block-wise missing data when integrating multiple longitudinal study datasets. Through simulation experiments and practical applications, the method is demonstrated to be effective and superior in handling high proportions of missing data and a large number of covariates.