Abstract:e13624 Background: Cancer staging is instrumental in driving clinical management and trial enrollment, but staging data are generally unreliable and unstructured in the electronic health record (EHR). Advances in natural language processing (NLP) may facilitate clinical staging and documentation [1], but challenges to real-world implementation include (1) automatically identifying appropriate patients and reports from the EHR and (2) developing an unbiased dataset for training and validation [2]. We describe our institution’s novel approach to overcome these barriers while building an in-house NLP pipeline for clinical tumor staging of non-small-cell lung cancer (NSCLC). Methods: We identified patients by searching our EHR (Epic) for a molecular analysis test ordered specifically for pathological diagnoses of NSCLC at our institution. We used the test order date as the diagnosis proxy date (DPD). For each patient, we extracted imaging reports up to 16 weeks before and 6 weeks after the DPD. To derive primary tumor size, we analyzed the CT Chest or PET/CT report closest to the DPD using an oncology-trained NLP text extraction and labeling tool (John Snow Labs). We cleaned all extracted tumor size entities and identified the largest measurement linked to the lungs. We compared primary tumor measurements from the NLP pipeline to those in a preexisting, manually compiled cancer registry (CNEXT). We manually analyzed discrepancies through chart review. Results: 542 patients with a DPD between 11/2016 - 9/2023 were processed through the NLP pipeline. Of 443 patients with valid values in both the pipeline and CNEXT, 53% (234) were exact matches, and 20% (90) had a close match (within 0-5mm), yielding a 73% accuracy rate for values within 5mm. When mismatched values were manually reviewed, several cases in CNEXT were found to have a DPD differing by more than 3 months and tumor sizes derived from external reports. When these cases were excluded, 320 of the remaining 349 patients had valid values in both the pipeline and the updated manual review. In this refined population, 66% (213) were exact matches, and 15% (48) had a close match, yielding an 82% accuracy rate for values within 5mm. Conclusions: To our knowledge, this is the first report of a pathology-based method to automatically and reliably identify patients with NSCLC and their relevant imaging reports directly from the EHR. We used a prebuilt NLP tool to derive primary tumor sizes with relatively high accuracy and found that adding flags for timeline discrepancies and external reports can further improve validity. As we near completion of analogous pipelines for node and metastasis staging, we will develop methodology to identify subgroups of patients that can be clinically staged with near-perfect accuracy, ultimately aiming to substantially limit manual staging of uncomplicated cases. 1. Puts 2023. 2. Wang 2022.

Evaluating the accuracy of lung-RADS score extraction from radiology reports: Manual entry versus natural language processing

Automatic extraction of imaging observation and assessment categories from breast magnetic resonance imaging reports with natural language processing.

ARTIFICIAL INTELLIGENCE: NATURAL LANGUAGE PROCESSING FOR PEER-REVIEW IN RADIOLOGY

Natural Language Processing to Identify Abnormal Breast, Lung, and Cervical Cancer Screening Test Results from Unstructured Reports to Support Timely Follow-up.

Automatic Extraction of Lung Cancer Staging Information from Computed Tomography Reports: Deep Learning Approach.

Cross-Institutional Structured Radiology Reporting for Lung Cancer Screening Using a Dynamic Template-Constrained Large Language Model

Novel approach to implementing natural language processing for clinical staging of non-small-cell lung cancer.

Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre

Development and Validation of a Dynamic-Template-Constrained Large Language Model for Generating Fully-Structured Radiology Reports

Natural language processing for populating lung cancer clinical research data

Nimg-63. Leveraging Llms For Accurate Differentiation Of Radiation Necrosis And Tumor Progression In Brain Mri Reports: A Study On Automated Scoring And Clinical Implications

Automating Stroke Data Extraction From Free-Text Radiology Reports Using Natural Language Processing: Instrument Validation Study

The implementation of natural language processing to extract index lesions from breast magnetic resonance imaging reports

Leveraging natural language processing to identify eligible lung cancer screening patients with the electronic health record

A real-world evaluation of the diagnostic accuracy of radiologists using positive predictive values verified from deep learning and natural language processing chest algorithms deployed retrospectively

Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study

Using Recurrent Neural Networks to Extract High-Quality Information From Lung Cancer Screening Computerized Tomography Reports for Inter-Radiologist Audit and Feedback Quality Improvement

Comparison of AI software tools for automated detection, quantification and categorization of pulmonary nodules in the HANSE LCS trial

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

Automating the detection of treatment progression in patients with lung cancer using large language models.

Application of natural language processing to post-structuring of rectal cancer MRI reports