Abstract:Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower's distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower's distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.

Evaluations on Several Imputation Approaches of Integrated Omics Data

Evaluating Imputation Methods for Single-Cell RNA-seq Data

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Evaluation of different approaches for missing data imputation on features associated to genomic data

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

GSimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach for Metabolomics Studies

Multi-metric comparison of machine learning imputation methods with application to breast cancer survival

Missing Data Imputation: Focusing on Single Imputation.

TOBMI: Trans-omics block missing data imputation using a k-Nearest Neighbor weighted approach.

Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis

A systematic evaluation of single-cell RNA-sequencing imputation methods

Imputation techniques on missing values in breast cancer treatment and fertility data

Coupling Deep Imputation with Multitask Learning for Downstream Tasks on Genomics Data

Evaluating the state of the art in missing data imputation for clinical data

Fast matrix completion in epigenetic methylation studies with informative covariates

Missing Value Estimation Algorithms on Cluster and Representativeness Preservation of Gene Expression Microarray Data

Imputation methods for mixed datasets in bioarchaeology

CHOOSING APPROPRIATE IMPUTATION METHODS FOR MISSING DATA: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets

Biotransformations catalyzed by multimeric enzymes: stabilization of tetrameric ampicillin acylase permits the optimization of ampicillin synthesis under dissociation conditions.