Abstract:Introductory Statement: The goal is to use machine learning (ML) and large language model (LLM) to augment the manual curation of cancer data elements. Introduction: Memorial Sloan Kettering Cancer Center (MSKCC) has ~100,000 cancer patients and counting with genomic testing. Clinicians use genomic data for research but lack clinical data to analyze together. We use a vendor, VASTA Global to hire curators to manually curate cancer patient's core clinical data elements (CCDE) within unstructured/paragraph text in electronic medical record (EMR) notes. CCDE encompasses 122 data elements that include a patient's full cancer history that can take up to 1 working day to curate. We collaborated with the Realyze Intelligence Healthcare Solutions vendor to use their AI pipeline to generate the manual curated dataset. Realyze generated the CCDE data elements such as histology, pathology site, MMR, TNM staging, ECOG, and KPS for a pilot lung cancer cohort of 150 patients. We manually validated the generated data for 74 out of 150 patients. Methods:The Realyze platform uses a combination of LLMs, ML algorithms and standard terminologies to create a cancer patient model. These models are flexible enough to address the unique needs and challenges of a pan-cancer oncology model. By using standardized FHIR export, results were delivered to a data lake solution and written into a REDCap database to enable human review. Summary:We manually assessed 74 patients. The NLP gave concordant values for MMR, KPS and TNM staging for 100% of the instances. For MMR these were all null values with false negative (FN) of 100% accuracy. Pathology site had 92.15% accuracy while histology has 97.5% accuracy. Conclusion:Will work on refining pathology site and histology's ICDO3 list to increase the percentage of accuracy. Once Realyze refines their model for these data elements we will re-run it on a larger cohort of cancer patients and calculate the accuracy. Accuracy Results Clinical data elements 74 patients assessed: Accuracy % ECOG 98.6 KPS 100 T (path) 100 T (clinical) 100 N (path) 100 N (clinical) 100 M (path) 100 M(clinical) 100 MMR 100 Histology (path) 97.5 Path site 92.15 Citation Format: Andrew Niederhausern, Nadia S. Bahadur, Gary Wallace, Gilan E. Saadawi, John Philip. Machine learning and large language model approach to pancancer data elements [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 4966.

Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing

Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing

Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

Large language models for extracting histopathologic diagnoses from electronic health records

Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer

Privacy-preserving large language models for structured medical information retrieval

CORAL: Expert-Curated medical Oncology Reports to Advance Language Model Inference

Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports

From Text to Tables: A Local Privacy Preserving Large Language Model for Structured Information Retrieval from Medical Documents

Large Multimodal Model based Standardisation of Pathology Reports with Confidence and their Prognostic Significance

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Transformative potential of Large Language Models in data mining on Electronic Health Records.

Assessing Large Language Models for Oncology Data Inference from Radiology Reports

A survey analysis of the adoption of large language models among pathologists

Use of Natural Language Processing to Infer Sites of Metastatic Disease From Radiology Reports at Scale

Abstract 4966: Machine learning and large language model approach to pancancer data elements

Enhancing Clinical Data Extraction from Pathology Reports: A Comparative Analysis of Large Language Models

Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports

Synoptic Reporting by Summarizing Cancer Pathology Reports using Large Language Models

Large language models for structured reporting in radiology: past, present, and future