Machine Learning for Structured Clinical Data

Brett K. Beaulieu-Jones

DOI: https://doi.org/10.48550/arXiv.1707.06997

2017-07-21

Abstract:Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. Furthermore, algorithms that produce black box results do not provide the interpretability required for clinical adoption. This chapter discusses these challenges and others in applying machine learning techniques to the structured EHR (i.e. Patient Demographics, Family History, Medication Information, Vital Signs, Laboratory Tests, Genetic Testing). It does not cover feature extraction from additional sources such as imaging data or free text patient notes but the approaches discussed can include features extracted from these sources.

Machine Learning

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the challenges faced in applying machine - learning techniques in structured electronic health records (EHR). Specifically, these problems include: 1. **Data standardization and formatting problems**: - Since the main purpose of EHR is patient care and billing processing, the data is not standardized or formatted in a way suitable for machine learning. - Data may be missing for various reasons, such as different input styles or differences in clinical decisions (e.g., which laboratory tests to choose). 2. **Limitations of sample size and the gold standard**: - Few patients are labeled as research - quality data, which limits the sample size and results in a constantly changing gold standard. - The progress of patients over time is crucial for understanding many diseases, but many machine - learning algorithms need to create a usable vector form at a single point in time. 3. **Interpretability problems**: - Many machine - learning algorithms produce "black - box" results and lack the interpretability required for clinical adoption. 4. **Longitudinal data modeling and semi - supervised learning**: - The characteristics of EHR data make unsupervised clustering and semi - supervised classification key tasks. 5. **Privacy, reproducibility, and data sharing**: - When using EHR for secondary analysis, patient privacy must be ensured and the risk of data re - identification must be avoided. - Techniques such as differential privacy can balance the usefulness of data and privacy protection. ### Formula representation To better understand these problems, we can use some formulas to describe data - missing situations: - **Missing Completely at Random (MCAR)**: Assuming that data missing is independent of any observed or unobserved data, it can be represented by the following conditional probability: \[ P(\text{missing} | Y_{\text{obs}}, Y_{\text{mis}}) = P(\text{missing}) \] - **Missing at Random (MAR)**: Assuming that data missing is only related to the observed data, it can be represented by the following conditional probability: \[ P(\text{missing} | Y_{\text{obs}}, Y_{\text{mis}}) = P(\text{missing} | Y_{\text{obs}}) \] - **Missing Not at Random (MNAR)**: Assuming that data missing is related to the unobserved data, it can be represented by the following conditional probability: \[ P(\text{missing} | Y_{\text{obs}}, Y_{\text{mis}}) \neq P(\text{missing} | Y_{\text{obs}}) \] These formulas are helpful for understanding different types of data missing and their impact on subsequent analysis. ### Summary This paper discusses how to address the above challenges and proposes several solutions, such as feature selection, algorithm modification, and data imputation methods to handle missing data. At the same time, it emphasizes the importance of ensuring data privacy when applying machine - learning techniques.

Machine Learning for Structured Clinical Data

Applying Machine Learning Approaches to Suicide Prediction Using Healthcare Data: Overview and Future Directions

A Review of Challenges and Opportunities in Machine Learning for Health

Machine learning in healthcare -- a system's perspective

Clinical Applications of Machine Learning

Machine Learning and Visualization in Clinical Decision Support: Current State and Future Directions

Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis

Can structured EHR data support clinical coding? A data mining approach

Strategies for Implementing Machine Learning Algorithms in the Clinical Practice of Radiology

From Real‐World Patient Data to Individualized Treatment Effects Using Machine Learning: Current and Future Methods to Address Underlying Challenges

Machine Learning for Clinical Decision-Making: Challenges and Opportunities in Cardiovascular Imaging

Machine learning and artificial intelligence: applications in healthcare epidemiology

Clinical Text Data in Machine Learning: Systematic Review

Machine Learning for Administrative Health Records: A Systematic Review of Techniques and Applications

A survey of machine learning techniques in medical applications

Big data and machine learning algorithms for health-care delivery

Data Mining and Electronic Health Records: Selecting Optimal Clinical Treatments in Practice

Using Deep Learning Based Natural Language Processing Techniques for Clinical Decision-Making with EHRs

The Dependence of Machine Learning on Electronic Medical Record Quality

Machine learning in medicine: Addressing ethical challenges

Automating Construction of Machine Learning Models With Clinical Big Data: Proposal Rationale and Methods