Abstract:The Health and Aging Brain Study–Health Disparities (HABS–HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS–HD is missing data. It is impossible to achieve accurate machine learning (ML) if data contain missing values. Therefore, developing a new imputation methodology has become an urgent task for HABS–HD. The three missing data assumptions, (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), necessitate distinct imputation approaches for each mechanism of missingness. Several popular imputation methods, including listwise deletion, min, mean, predictive mean matching (PMM), classification and regression trees (CART), and missForest, may result in biased outcomes and reduced statistical power when applied to downstream analyses such as testing hypotheses related to clinical variables or utilizing machine learning to predict AD or MCI. Moreover, these commonly used imputation techniques can produce unreliable estimates of missing values if they do not account for the missingness mechanisms or if there is an inconsistency between the imputation method and the missing data mechanism in HABS–HD. Therefore, we proposed a three-step workflow to handle missing data in HABS–HD: (1) missing data evaluation, (2) imputation, and (3) imputation evaluation. First, we explored the missingness in HABS–HD. Then, we developed a machine learning-based multiple imputation method (MLMI) for imputing missing values. We built four ML-based imputation models (support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and lasso and elastic-net regularized generalized linear model (GLMNET)) and adapted the four ML-based models to multiple imputations using the simple averaging method. Lastly, we evaluated and compared MLMI with other common methods. Our results showed that the three-step workflow worked well for handling missing values in HABS–HD and the ML-based multiple imputation method outperformed other common methods in terms of prediction performance and change in distribution and correlation. The choice of missing handling methodology has a significant impact on the accompanying statistical analyses of HABS–HD. The conceptual three-step workflow and the ML-based multiple imputation method perform well for our Alzheimer’s disease models. They can also be applied to other disease data analyses.

A Generative Model For Evaluating Missing Data Methods in Large Epidemiological Cohorts

Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Multiple Imputation for Incomplete Data in Epidemiologic Studies

Benchmarking missing-values approaches for predictive models on health databases

Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project

Evaluation of different approaches for missing data imputation on features associated to genomic data

Imputation methods for mixed datasets in bioarchaeology

19 Incomplete Data in Epidemiology and Medical Statistics

Multiple Imputation for Multilevel Data with Continuous and Binary Variables

A Bayesian two-step multiple imputation approach based on mixed models for the missing in EMA data

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Comparison of Missing Data Imputation Methods using the Framingham Heart study dataset

Evaluations on Several Imputation Approaches of Integrated Omics Data

A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities

Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study

Lung Cancer Risk Estimation with Incomplete Data: A Joint Missing Imputation Perspective

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review

Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records

Multiple Imputation by Ordered Monotone Blocks with Application to the Anthrax Vaccine Research Program