Machine Learning for Structured Clinical Data

Brett K. Beaulieu-Jones
DOI: https://doi.org/10.48550/arXiv.1707.06997
2017-07-21
Abstract:Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. Furthermore, algorithms that produce black box results do not provide the interpretability required for clinical adoption. This chapter discusses these challenges and others in applying machine learning techniques to the structured EHR (i.e. Patient Demographics, Family History, Medication Information, Vital Signs, Laboratory Tests, Genetic Testing). It does not cover feature extraction from additional sources such as imaging data or free text patient notes but the approaches discussed can include features extracted from these sources.
Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the challenges faced in applying machine - learning techniques in structured electronic health records (EHR). Specifically, these problems include: 1. **Data standardization and formatting problems**: - Since the main purpose of EHR is patient care and billing processing, the data is not standardized or formatted in a way suitable for machine learning. - Data may be missing for various reasons, such as different input styles or differences in clinical decisions (e.g., which laboratory tests to choose). 2. **Limitations of sample size and the gold standard**: - Few patients are labeled as research - quality data, which limits the sample size and results in a constantly changing gold standard. - The progress of patients over time is crucial for understanding many diseases, but many machine - learning algorithms need to create a usable vector form at a single point in time. 3. **Interpretability problems**: - Many machine - learning algorithms produce "black - box" results and lack the interpretability required for clinical adoption. 4. **Longitudinal data modeling and semi - supervised learning**: - The characteristics of EHR data make unsupervised clustering and semi - supervised classification key tasks. 5. **Privacy, reproducibility, and data sharing**: - When using EHR for secondary analysis, patient privacy must be ensured and the risk of data re - identification must be avoided. - Techniques such as differential privacy can balance the usefulness of data and privacy protection. ### Formula representation To better understand these problems, we can use some formulas to describe data - missing situations: - **Missing Completely at Random (MCAR)**: Assuming that data missing is independent of any observed or unobserved data, it can be represented by the following conditional probability: \[ P(\text{missing} | Y_{\text{obs}}, Y_{\text{mis}}) = P(\text{missing}) \] - **Missing at Random (MAR)**: Assuming that data missing is only related to the observed data, it can be represented by the following conditional probability: \[ P(\text{missing} | Y_{\text{obs}}, Y_{\text{mis}}) = P(\text{missing} | Y_{\text{obs}}) \] - **Missing Not at Random (MNAR)**: Assuming that data missing is related to the unobserved data, it can be represented by the following conditional probability: \[ P(\text{missing} | Y_{\text{obs}}, Y_{\text{mis}}) \neq P(\text{missing} | Y_{\text{obs}}) \] These formulas are helpful for understanding different types of data missing and their impact on subsequent analysis. ### Summary This paper discusses how to address the above challenges and proposes several solutions, such as feature selection, algorithm modification, and data imputation methods to handle missing data. At the same time, it emphasizes the importance of ensuring data privacy when applying machine - learning techniques.