Abstract:Effective, reliable, and scalable development of machine learning (ML) solutions for structured electronic health record (EHR) data requires the ability to reliably generate high-quality baseline models for diverse supervised learning tasks in an efficient and performant manner. Historically, producing such baseline models has been a largely manual effort--individual researchers would need to decide on the particular featurization and tabularization processes to apply to their individual raw, longitudinal data; and then train a supervised model over those data to produce a baseline result to compare novel methods against, all for just one task and one dataset. In this work, powered by complementary advances in core data standardization through the MEDS framework, we dramatically simplify and accelerate this process of tabularizing irregularly sampled time-series data, providing researchers the ability to automatically and scalably featurize and tabularize their longitudinal EHR data across tens of thousands of individual features, hundreds of millions of clinical events, and diverse windowing horizons and aggregation strategies, all before ultimately leveraging these tabular data to automatically produce high-caliber XGBoost baselines in a highly computationally efficient manner. This system scales to dramatically larger datasets than tabularization tools currently available to the community and enables researchers with any MEDS format dataset to immediately begin producing reliable and performant baseline prediction results on various tasks, with minimal human effort required. This system will greatly enhance the reliability, reproducibility, and ease of development of powerful ML solutions for health problems across diverse datasets and clinical settings.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to generate high - quality baseline models for structured electronic health record (EHR) data efficiently, reliably and scalably to support multiple supervised learning tasks. Specifically, the author points out that current researchers need to manually handle the process from the original longitudinal EHR data to the process of feature extraction and tabulation, and train supervised models to produce baseline results. This process is not only time - consuming but also lacks standardization, making it difficult to compare baseline models between different studies and affecting the reliability and reproducibility of machine learning in the medical field. To address these issues, MEDS - Tab simplifies and accelerates this process in the following ways: 1. **Automated tabulation**: MEDS - Tab can automatically convert irregularly sampled time - series data into a tabular format suitable for decision - tree models (such as XGBoost). This includes feature extraction and tabulation of a large number of individual features, hundreds of millions of clinical events, and diverse window aggregation strategies. 2. **Efficient baseline model generation**: MEDS - Tab uses AutoML tools to automatically optimize high - performance tree - based machine - learning methods, thereby quickly generating high - quality baseline models on large - scale medical data sets. 3. **High scalability**: MEDS - Tab can handle much larger data sets than existing tools and is applicable to any data set conforming to the MEDS format, enabling researchers to immediately start generating reliable baseline prediction results without a great deal of manual intervention. Overall, MEDS - Tab aims to significantly reduce the workload of researchers, improve the reliability and reproducibility of baseline models, and promote the development of powerful machine - learning solutions in different data sets and clinical environments.

MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets

A Method for the Early Prediction of Chronic Diseases Based on Short Sequential Medical Data.

MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement

Evaluating Model Performance in Medical Datasets Over Time

Enabling scalable clinical interpretation of ML-based phenotypes using real world data

EHR-ML: A generalisable pipeline for reproducible clinical outcomes using electronic health records

Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks

EHRMamba: Towards Generalizable and Scalable Foundation Models for Electronic Health Records

MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare

General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

FEDMEKI: A Benchmark for Scaling Medical Foundation Models via Federated Knowledge Injection

Benchmarking emergency department prediction models with machine learning and public electronic health records

Targeted learning with daily EHR data

A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records

meds_reader: A fast and efficient EHR processing library

Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform

Towards Evaluating and Building Versatile Large Language Models for Medicine

medExtractR: A targeted, customizable approach to medication extraction from electronic health records

Bridging health registry data acquisition and real-time data analytics