Abstract:Objective: To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free-text clinical notes. Background: Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. Methods: This was a retrospective cross-sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) model zero-shot, (3) GPT-2 QA model few-shot training fine-tuned on clinical notes, and (4) GPT-2 generative model few-shot training fine-tuned on clinical notes to generate the answer by considering the context of included text. Results: The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT-2 generative model was the best-performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT-2-based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R2 score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. Conclusion: We developed a robust information extraction model based on a state-of-the-art large language model, a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT-2-based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub. Additional fine-tuning of the algorithm might be required when applied to different health-care systems for various clinical use cases.

Effective Natural Language Processing Algorithms for Gout Flare Early Alert from Chief Complaints

Improving the accuracy of automated gout flare ascertainment using natural language processing of electronic health records and linked Medicare claims data

Large language models for accurate disease detection in electronic health records

Development and Evaluation of Machine Learning Models for the Detection of Emergency Department Patients with Opioid Misuse from Clinical Notes

Identifying signs and symptoms of urinary tract infection from emergency department clinical notes using large language models

Advancing rheumatology with natural language processing: insights and prospects from a systematic review

Detecting goals of care conversations in clinical notes with active learning

Textual Triage: Assessing GPT4 for Classification of Free-Text Medication-Related Messages for Hypertension Management

Early identification of patients with acute gastrointestinal bleeding using natural language processing and decision rules

Large Language Models Improve the Identification of Emergency Department Visits for Symptomatic Kidney Stones

Classifying Characteristics of Opioid Use Disorder From Hospital Discharge Summaries Using Natural Language Processing

Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room

A Large Language Model-Based Generative Natural Language Processing Framework Finetuned on Clinical Notes Accurately Extracts Headache Frequency from Electronic Health Records

Deep learning-based natural language processing for detecting medical symptoms and histories in emergency patient triage

A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records

Natural language processing with deep learning for medical adverse event detection from free-text medical narratives: A case study of detecting total hip replacement dislocation

Development of a deep learning model for the automated detection of green pixels indicative of gout on dual energy CT scan

GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models

Emerging Applications of NLP and Large Language Models in Gastroenterology and Hepatology: A Systematic Review

Using natural language processing in emergency medicine health service research: A systematic review and meta-analysis

Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study