Abstract:Objective The use of electronic health records (EHRs) holds promising potential to enhance clinical trial activities. However, the identification of eligible patients within EHRs presents considerable challenges. Our objective was to develop an eligibility criteria phenotyping pipeline that would identify patients with matching clinical characteristics from EHRs. Material and methods In this study, we utilized clinical trial eligibility criteria from clinicaltrial.gov and patients EHR datasets from the Sema4 data warehouse, which include multiple heath provider datasets. To ensure computability and queryability, the eligibility criteria attributes and clinical characteristics in EHRs were normalized using four national standard terminologies, LIONC, ICD-9-CM, ICD-10-CM, and CPT, along with four in-house knowledge bases containing procedures, medications, biomarkers, and diagnosis modifiers. The process involved a semi-automated approach incorporating rule-based, pattern recognition, and manual annotation methods. The quality of machine-normalized criteria attributes was accessed using Cohens Kappa coefficient on randomly selected criteria, and the accuracy of our matching between normalized criteria and patient clinical characteristics was evaluated using precision, recall, and F1 score on randomly selected patients. Results A total of 640 unique eligibility criteria attributes were identified, covering various medical conditions, including five types of cancer (non-small cell lung cancer, small cell lung cancer, prostate cancer, breast cancer, and multiple myeloma), two autoimmune diseases (ulcerative colitis and Crohns disease), one metabolic disorder (non-alcoholic steatohepatitis), and a rare disease (sickle cell anemia). Among these attributes, 367 eligibility criteria attributes were normalized. 174 were encoded with standard terminologies and 193 were normalized using the in-house reference tables. The agreement between automated and manually annotated normalized codes was found to be 0.82 and matching between eligibility criteria attribute and patient clinical information achieved a high F1-score of 0.94. Conclusion We established a clinical phenotyping pipeline facilitating effective communication between the eligibility criteria and EHR. The pipeline demonstrated its generalizability by being applied to EHR data from different institutes. Our pipeline shows the potential to significantly enhance the utilization of EHRs in clinical trial activities and improve patient matching and selection processes, thereby advancing clinical research and patient outcomes.

Enabling scalable clinical interpretation of ML-based phenotypes using real world data

A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

A Scalable Workflow to Build Machine Learning Classifiers with Clinician-in-the-Loop to Identify Patients in Specific Diseases

Toward Cross‐platform Electronic Health Record‐driven Phenotyping Using Clinical Quality Language

Visual Cluster Analysis in Support of Clinical Decision Intelligence.

Communicating exploratory unsupervised machine learning analysis in age clustering for paediatric disease

Machine learning enabled subgroup analysis with real-world data to inform clinical trial eligibility criteria design

Unsupervised Extraction of Phenotypes from Cancer Clinical Notes for Association Studies

Deep representation learning of electronic health records to unlock patient stratification at scale

Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform

Longitudinal patient stratification of electronic health records with flexible adjustment for clinical outcomes

Automating Construction of Machine Learning Models With Clinical Big Data: Proposal Rationale and Methods

Stratification of Alzheimer's Disease Patients Using Knowledge-Guided Unsupervised Latent Factor Clustering with Electronic Health Record Data

Approach to machine learning for extraction of real-world data variables from electronic health records

Optimizing Patient Stratification in Healthcare: A Comparative Analysis of Clustering Algorithms for EHR Data

EHR-ML: A generalisable pipeline for reproducible clinical outcomes using electronic health records

Towards Structuring Real-World Data at Scale: Deep Learning for Extracting Key Oncology Information from Clinical Text with Patient-Level Supervision

Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

Scalable incident detection via natural language processing and probabilistic language models

Language-model-based patient embedding using electronic health records facilitates phenotyping, disease forecasting, and progression analysis

Establishing the Automatic Identification of Clinical Trial Cohorts from Electronic Health Records by Matching Normalized Eligibility Criteria and Patient Clinical Characteristics