Abstract:Clinical data storage in unstructured notes and siloed datasets present a major challenge for large-scale cancer informatics. Whether natural language processing (NLP) combined with multimodal integration across datasets can produce a mineable resource and improve discovery of relationships between tumor genomics and clinical phenotypes is unknown. We hypothesized that NLP could automatically annotate a pan-cancer corpus of 82,464 patients with tumor genomic sequencing. To develop algorithms to annotate free-text reports, we leveraged the AACR Project GENIE Biopharma Collaborative (BPC), a structured curation of EMR from five cancer types (non-small cell lung (NSCLC), breast, colorectal, prostate, and pancreatic cancer), to train and validate several Transformer and rule based-based NLP models. After automating the generation of NLP annotations alongside medication, demographic, tumor registry, survival, and tumor genomic sequencing data, we tested whether clinicogenomic relationships not apparent in the smaller BPC cohort might be discoverable in the larger cohort. In 5-fold cross-validation, NLP Transformers accurately annotated the presence of cancer (AUC=0.99), cancer progression (AUC=0.97), and sites of disease (AUC=0.99) from radiology reports, and presence of prior outside treatment (AUC=0.98) and hormone receptor (HR) and HER2 receptor status (AUC=0.98, 0.98) from clinician notes. In addition, rule-based models, trained on non-BPC data and validated on the whole BPC cohort, annotated smoking status from clinician notes (ACC=0.95), and Gleason score (ACC=1.0), PD-L1 status (ACC=0.98), and mismatch repair deficiency (ACC=0.98) from histopathology reports. NLP annotations were merged with genomic and other structured clinical data to create a Clinicogenomic, Harmonized Oncologic Real-world Dataset (MSK-CHORD). Finally, we tested if associations not apparent in the BPC might be discoverable in MSK-CHORD. We found positive associations between Gleason score and gene-level alterations in prostate cancer including TP53, PTEN and BRCA2 (q<0.1), none of which were adequately powered for detection in the BPC. We found PD-L1 status was associated with better survival following immunotherapy treatment in NSCLC, but only in the larger MSK-CHORD was this association statistically significant. In breast cancer, NF1 mutations were associated with prior therapy in both cohorts, but this association was only significant in MSK-CHORD. The infrastructure generating MSK-CHORD uses a combination of on-premise and cloud computing resources and open-source development operation applications to automate processes. Once annotations are created, data is imported into a local instance of cBioPortal, where researchers can visualize data and perform analyses. The system generating MSK-CHORD demonstrates how large-scale data delivery and integration can fuel cancer research. Citation Format: Christopher J. Fong, Karl Pichotta, Thinh Tran, Michele Waters, Tom Fu, Mono Pirun, Mirella Altoe, Brooke Mastrogiacomo, Anisha Luthra, Mehnaj Ahmed, Arfath Pasha, Armaan Kohli, Raymond Lim, Tom Pollard, Darin Moore, Benjamin Gross, Avery Wang, Calla Chennault, Ritika Kundra, Ramyasree Madupuri, Ino de Bruijn, Aaron Lisman, Walid K. Chatila, Subhiksha Nandakumar, Anika Begum, Doori Rose, Kenneth L. Kehl, Deborah Schrag, Michael Berger, Jian Carrot-Zhang, Pedram Razavi, Bob Li, Peter Stetson, Nikolaus Schultz, Justin Jee. Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD) [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 3892.

Developing Customizable Cancer Information Extraction Modules for Pathology Reports Using CLAMP.

CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines

Natural Language Processing to Identify Abnormal Breast, Lung, and Cervical Cancer Screening Test Results from Unstructured Reports to Support Timely Follow-up.

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

Extracting Systemic Anticancer Therapy and Response Information From Clinical Notes Following the RECIST Definition

DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case Abstraction

An accessible, efficient, and accurate natural language processing method for extracting diagnostic data from pathology reports

Development and Validation of a Natural Language Processing Algorithm for Extracting Clinical and Pathological Features of Breast Cancer From Pathology Reports

Natural language processing for populating lung cancer clinical research data

Abstract 3892: Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD)

Novel approach to implementing natural language processing for clinical staging of non-small-cell lung cancer.

Automated real-world data integration improves cancer outcome prediction

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

Evaluating Methods for Identifying Cancer in Free-Text Pathology Reports Using Various Machine Learning and Data Preprocessing Approaches

Use of Natural Language Processing to Infer Sites of Metastatic Disease From Radiology Reports at Scale

Abstract 7390: Learning Llama Agents for medical record analysis and standardization

caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer

The role of natural language processing in cancer care: a systematic scoping review with narrative synthesis

Assessment of Electronic Health Record for Cancer Research and Patient Care Through a Scoping Review of Cancer Natural Language Processing