MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets

Nassim Oufattole,Teya Bergamaschi,Aleksia Kolo,Hyewon Jeong,Hanna Gaggin,Collin M. Stultz,Matthew B.A. McDermott
2024-11-01
Abstract:Effective, reliable, and scalable development of machine learning (ML) solutions for structured electronic health record (EHR) data requires the ability to reliably generate high-quality baseline models for diverse supervised learning tasks in an efficient and performant manner. Historically, producing such baseline models has been a largely manual effort--individual researchers would need to decide on the particular featurization and tabularization processes to apply to their individual raw, longitudinal data; and then train a supervised model over those data to produce a baseline result to compare novel methods against, all for just one task and one dataset. In this work, powered by complementary advances in core data standardization through the MEDS framework, we dramatically simplify and accelerate this process of tabularizing irregularly sampled time-series data, providing researchers the ability to automatically and scalably featurize and tabularize their longitudinal EHR data across tens of thousands of individual features, hundreds of millions of clinical events, and diverse windowing horizons and aggregation strategies, all before ultimately leveraging these tabular data to automatically produce high-caliber XGBoost baselines in a highly computationally efficient manner. This system scales to dramatically larger datasets than tabularization tools currently available to the community and enables researchers with any MEDS format dataset to immediately begin producing reliable and performant baseline prediction results on various tasks, with minimal human effort required. This system will greatly enhance the reliability, reproducibility, and ease of development of powerful ML solutions for health problems across diverse datasets and clinical settings.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to generate high - quality baseline models for structured electronic health record (EHR) data efficiently, reliably and scalably to support multiple supervised learning tasks. Specifically, the author points out that current researchers need to manually handle the process from the original longitudinal EHR data to the process of feature extraction and tabulation, and train supervised models to produce baseline results. This process is not only time - consuming but also lacks standardization, making it difficult to compare baseline models between different studies and affecting the reliability and reproducibility of machine learning in the medical field. To address these issues, MEDS - Tab simplifies and accelerates this process in the following ways: 1. **Automated tabulation**: MEDS - Tab can automatically convert irregularly sampled time - series data into a tabular format suitable for decision - tree models (such as XGBoost). This includes feature extraction and tabulation of a large number of individual features, hundreds of millions of clinical events, and diverse window aggregation strategies. 2. **Efficient baseline model generation**: MEDS - Tab uses AutoML tools to automatically optimize high - performance tree - based machine - learning methods, thereby quickly generating high - quality baseline models on large - scale medical data sets. 3. **High scalability**: MEDS - Tab can handle much larger data sets than existing tools and is applicable to any data set conforming to the MEDS format, enabling researchers to immediately start generating reliable baseline prediction results without a great deal of manual intervention. Overall, MEDS - Tab aims to significantly reduce the workload of researchers, improve the reliability and reproducibility of baseline models, and promote the development of powerful machine - learning solutions in different data sets and clinical environments.