Abstract:Background Natural language processing (NLP) tools including recently developed large language models (LLMs) have myriad potential applications in medical care and research, including the efficient labeling and classification of unstructured text such as electronic health record (EHR) notes. This opens the door to large‐scale projects that rely on variables that are not typically recorded in a structured form, such as patient signs and symptoms. Objectives This study is designed to acquaint the emergency medicine research community with the foundational elements of NLP, highlighting essential terminology, annotation methodologies, and the intricacies involved in training and evaluating NLP models. Symptom characterization is critical to urinary tract infection (UTI) diagnosis, but identification of symptoms from the EHR has historically been challenging, limiting large‐scale research, public health surveillance, and EHR‐based clinical decision support. We therefore developed and compared two NLP models to identify UTI symptoms from unstructured emergency department (ED) notes. Methods The study population consisted of patients aged ≥ 18 who presented to an ED in a northeastern U.S. health system between June 2013 and August 2021 and had a urinalysis performed. We annotated a random subset of 1250 ED clinician notes from these visits for a list of 17 UTI symptoms. We then developed two task‐specific LLMs to perform the task of named entity recognition: a convolutional neural network‐based model (SpaCy) and a transformer‐based model designed to process longer documents (Clinical Longformer). Models were trained on 1000 notes and tested on a holdout set of 250 notes. We compared model performance (precision, recall, F1 measure) at identifying the presence or absence of UTI symptoms at the note level. Results A total of 8135 entities were identified in 1250 notes; 83.6% of notes included at least one entity. Overall F1 measure for note‐level symptom identification weighted by entity frequency was 0.84 for the SpaCy model and 0.88 for the Longformer model. F1 measure for identifying presence or absence of any UTI symptom in a clinical note was 0.96 (232/250 correctly classified) for the SpaCy model and 0.98 (240/250 correctly classified) for the Longformer model. Conclusions The study demonstrated the utility of LLMs and transformer‐based models in particular for extracting UTI symptoms from unstructured ED clinical notes; models were highly accurate for detecting the presence or absence of any UTI symptom on the note level, with variable performance for individual symptoms.

Large Language Model Symptom Identification from Clinical Text: A Multi-Center Study

Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses

Evaluating large language models on medical, lay language, and self-reported descriptions of genetic conditions

Identifying signs and symptoms of urinary tract infection from emergency department clinical notes using large language models

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Large Language Models for Disease Diagnosis: A Scoping Review

On the limitations of large language models in clinical diagnosis

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Large language models for accurate disease detection in electronic health records

Large Language Models in Healthcare: A Comprehensive Benchmark

Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Towards Accurate Differential Diagnosis with Large Language Models

Evaluation of large language models as a diagnostic aid for complex medical cases

Large Language Models in Medical Term Classification and Unexpected Misalignment Between Response and Reasoning

Large language models encode clinical knowledge

Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes

Can Large Language Models abstract Medical Coded Language?

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Exploring Large Language Models for Acronym, Symbol Sense Disambiguation, and Semantic Similarity and Relatedness Assessment

Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports