Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?

Zhi Chen,Sarah Tan,Urszula Chajewska,Cynthia Rudin,Rich Caruana
2023-04-24
Abstract:Missing values are a fundamental problem in data science. Many datasets have missing values that must be properly handled because the way missing values are treated can have large impact on the resulting machine learning model. In medical applications, the consequences may affect healthcare decisions. There are many methods in the literature for dealing with missing values, including state-of-the-art methods which often depend on black-box models for imputation. In this work, we show how recent advances in interpretable machine learning provide a new perspective for understanding and tackling the missing value problem. We propose methods based on high-accuracy glass-box Explainable Boosting Machines (EBMs) that can help users (1) gain new insights on missingness mechanisms and better understand the causes of missingness, and (2) detect -- or even alleviate -- potential risks introduced by imputation algorithms. Experiments on real-world medical datasets illustrate the effectiveness of the proposed methods.
Machine Learning
What problem does this paper attempt to address?
The paper primarily explores how to use interpretable machine learning methods to address the issue of missing values in healthcare data and proposes a high-precision, fully interpretable method called "Explainable Boosting Machines (EBM)" to understand and handle these missing values. The core issues the paper aims to address are: 1. **Understanding the causes of missing values**: Through the interpretability features of EBM, it helps users gain insights into the mechanisms that lead to missing values, thereby better understanding the reasons behind them. 2. **Detecting and mitigating risks introduced by missing value imputation algorithms**: In medical applications, the way missing values are handled can significantly impact the results of machine learning models, which in turn may affect medical decisions. Therefore, the researchers propose an EBM-based method to detect and even mitigate potential risks introduced by different imputation methods. Specifically, the solutions proposed in the paper include: - **Testing Missing Completely at Random (MCAR)**: Using EBM's shape functions to test whether the missing values conform to the MCAR assumption and proposing a new statistical test method to evaluate this. - **Handling assumed normal missing values**: For certain feature values (such as lab test results), if doctors believe that a patient is "normal" on this metric, they might not conduct the test. The paper demonstrates how to use EBM to identify this situation and proposes a model editing method to correct the issues arising from it. - **Predicting patterns of missingness**: By training EBM to predict the missingness of a variable, it can provide insights into the mechanisms of missingness. The paper showcases the interpretable results obtained in this way, revealing complex patterns of missingness in healthcare data. In summary, this paper aims to leverage interpretable machine learning techniques to provide tools for data scientists and researchers in the healthcare field, enabling them to better understand and handle the issue of missing values in datasets.