Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?

Zhi Chen,Sarah Tan,Urszula Chajewska,Cynthia Rudin,Rich Caruana

2023-04-24

Abstract:Missing values are a fundamental problem in data science. Many datasets have missing values that must be properly handled because the way missing values are treated can have large impact on the resulting machine learning model. In medical applications, the consequences may affect healthcare decisions. There are many methods in the literature for dealing with missing values, including state-of-the-art methods which often depend on black-box models for imputation. In this work, we show how recent advances in interpretable machine learning provide a new perspective for understanding and tackling the missing value problem. We propose methods based on high-accuracy glass-box Explainable Boosting Machines (EBMs) that can help users (1) gain new insights on missingness mechanisms and better understand the causes of missingness, and (2) detect -- or even alleviate -- potential risks introduced by imputation algorithms. Experiments on real-world medical datasets illustrate the effectiveness of the proposed methods.

Machine Learning

What problem does this paper attempt to address?

The paper primarily explores how to use interpretable machine learning methods to address the issue of missing values in healthcare data and proposes a high-precision, fully interpretable method called "Explainable Boosting Machines (EBM)" to understand and handle these missing values. The core issues the paper aims to address are: 1. **Understanding the causes of missing values**: Through the interpretability features of EBM, it helps users gain insights into the mechanisms that lead to missing values, thereby better understanding the reasons behind them. 2. **Detecting and mitigating risks introduced by missing value imputation algorithms**: In medical applications, the way missing values are handled can significantly impact the results of machine learning models, which in turn may affect medical decisions. Therefore, the researchers propose an EBM-based method to detect and even mitigate potential risks introduced by different imputation methods. Specifically, the solutions proposed in the paper include: - **Testing Missing Completely at Random (MCAR)**: Using EBM's shape functions to test whether the missing values conform to the MCAR assumption and proposing a new statistical test method to evaluate this. - **Handling assumed normal missing values**: For certain feature values (such as lab test results), if doctors believe that a patient is "normal" on this metric, they might not conduct the test. The paper demonstrates how to use EBM to identify this situation and proposes a model editing method to correct the issues arising from it. - **Predicting patterns of missingness**: By training EBM to predict the missingness of a variable, it can provide insights into the mechanisms of missingness. The paper showcases the interpretable results obtained in this way, revealing complex patterns of missingness in healthcare data. In summary, this paper aims to leverage interpretable machine learning techniques to provide tools for data scientists and researchers in the healthcare field, enabling them to better understand and handle the issue of missing values in datasets.

Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?

On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques

Missing Data Exploration: Highlighting Graphical Presentation of Missing Pattern.

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study

Benchmarking missing-values approaches for predictive models on health databases

Explainability of Machine Learning Models under Missing Data

Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis

Missing Values in Big Data Research: Some Basic Skills

Handling the Missing Data Problem in Electronic Health Records for Cancer Prediction.

Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records

Impact of Missing Values in Machine Learning: A Comprehensive Analysis

A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities

Mining for equitable health: Assessing the impact of missing data in electronic health records

Evaluating the state of the art in missing data imputation for clinical data

Missing value imputation using unsupervised machine learning techniques

Attention-based Imputation of Missing Values in Electronic Health Records Tabular Data

Interpretability of machine learning‐based prediction models in healthcare

Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records

Imputation techniques on missing values in breast cancer treatment and fertility data

Imputation of missing values for electronic health record laboratory data