Abstract:When integrating data from multiple sources, a common challenge is block-wise missing. Most existing methods address this issue only in cross-sectional studies. In this paper, we propose a method for variable selection when combining datasets from multiple sources in longitudinal studies. To account for block-wise missing in covariates, we impute the missing values multiple times based on combinations of samples from different missing pattern and predictors from different data sources. We then use these imputed data to construct estimating equations, and aggregate the information across subjects and sources with the generalized method of moments. We employ the smoothly clipped absolute deviation penalty in variable selection and use the extended Bayesian Information Criterion criteria for tuning parameter selection. We establish the asymptotic properties of the proposed estimator, and demonstrate the superior performance of the proposed method through numerical experiments. Furthermore, we apply the proposed method in the Alzheimer's Disease Neuroimaging Initiative study to identify sensitive early-stage biomarkers of Alzheimer's Disease, which is crucial for early disease detection and personalized treatment.

What problem does this paper attempt to address?

The paper attempts to address the issue of block-wise missing data encountered when integrating multiple longitudinal study datasets. Specifically, most existing methods primarily target block-wise missing data in cross-sectional studies, whereas in longitudinal studies, this missing pattern is more complex because data for each subject may be partially missing at different time points. The paper proposes a variable selection method based on multiple imputation to handle block-wise missing data from multiple data sources. ### Main Issues: 1. **Handling Block-Wise Missing Data**: How to effectively handle block-wise missing data when integrating multiple longitudinal study datasets, especially when the missing proportion is high and the number of covariates is large. 2. **Variable Selection**: How to perform effective variable selection in the presence of block-wise missing data to identify early biomarkers related to Alzheimer's Disease (AD). ### Background: - **Multi-Source Data**: In modern scientific research, multi-source data is receiving increasing attention, but these data often have block-wise missing issues. - **Alzheimer's Disease Neuroimaging Initiative (ADNI)**: The ADNI dataset contains a large amount of block-wise missing data, particularly in data from cognitively normal (NC), mild cognitive impairment (MCI), and Alzheimer's Disease patients. - **Limitations of Existing Methods**: Traditional methods such as complete case analysis and maximum likelihood methods are inefficient in handling high proportions of missing data and cannot handle multiple missing patterns. ### Solution: - **Multiple Imputation**: Impute missing values multiple times, using predictors from different missing patterns and different data sources to fill in the missing data. - **Estimating Equations**: Construct estimating equations based on imputed data and use the Generalized Method of Moments (GMM) to integrate information. - **Variable Selection**: Introduce Smoothly Clipped Absolute Deviation (SCAD) for variable selection and use the Extended Bayesian Information Criterion (EBIC) to select tuning parameters. ### Application: - **Alzheimer's Disease Research**: Apply the proposed method to the ADNI dataset to identify early biomarkers of Alzheimer's Disease, which is significant for early disease detection and personalized treatment. ### Summary: The paper proposes a new variable selection method based on multiple imputation to address the issue of block-wise missing data when integrating multiple longitudinal study datasets. Through simulation experiments and practical applications, the method is demonstrated to be effective and superior in handling high proportions of missing data and a large number of covariates.

Imputation-Based Variable Selection Method for Block-Wise Missing Data When Integrating Multiple Longitudinal Studies

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

Integrating multi-source block-wise missing data in model selection

Variable Selection for Longitudinal Data with High-Dimensional Covariates and Dropouts

Simultaneous variable selection and parameters estimation for longitudinal data subject to missingness and covariates measurement error

Improving Regression Analysis with Imputation in a Longitudinal Study of Alzheimer's Disease

Adaptive greedy forward variable selection for linear regression models with incomplete data using multiple imputation

High-dimensional variable selection accounting for heterogeneity in regression coefficients across multiple data sources

Multiple imputation in data that grow over time: A comparison of three strategies

Simultaneous variable selection and estimation in semiparametric regression of mixed panel count data

Empirical Likelihood Inference for Longitudinal Data with Missing Response Variables and Error-Prone Covariates

[Modern approaches to the diagnosis and treatment of craniocerebral trauma and its sequelae].

Variable Selection in Robust Joint Mean and Covariance Model for Longitudinal Data Analysis

Flexible variable selection in the presence of missing data

A penalized integrative deep neural network for variable selection among multiple omics datasets

Multiple Imputation When Variables Exceed Observations: An Overview of Challenges and Solutions

Individualized Multi-directional Variable Selection

A unified framework of analyzing missing data and variable selection using regularized likelihood

A comparison of strategies for selecting auxiliary variables for multiple imputation

Multiple Imputation with Factor Scores: A Practical Approach for Handling Simultaneous Missingness Across Items in Longitudinal Designs

Variable Selection in Quantile Varying Coefficient Models with Longitudinal Data