Abstract:Clinical data storage in unstructured notes and siloed datasets present a major challenge for large-scale cancer informatics. Whether natural language processing (NLP) combined with multimodal integration across datasets can produce a mineable resource and improve discovery of relationships between tumor genomics and clinical phenotypes is unknown. We hypothesized that NLP could automatically annotate a pan-cancer corpus of 82,464 patients with tumor genomic sequencing. To develop algorithms to annotate free-text reports, we leveraged the AACR Project GENIE Biopharma Collaborative (BPC), a structured curation of EMR from five cancer types (non-small cell lung (NSCLC), breast, colorectal, prostate, and pancreatic cancer), to train and validate several Transformer and rule based-based NLP models. After automating the generation of NLP annotations alongside medication, demographic, tumor registry, survival, and tumor genomic sequencing data, we tested whether clinicogenomic relationships not apparent in the smaller BPC cohort might be discoverable in the larger cohort. In 5-fold cross-validation, NLP Transformers accurately annotated the presence of cancer (AUC=0.99), cancer progression (AUC=0.97), and sites of disease (AUC=0.99) from radiology reports, and presence of prior outside treatment (AUC=0.98) and hormone receptor (HR) and HER2 receptor status (AUC=0.98, 0.98) from clinician notes. In addition, rule-based models, trained on non-BPC data and validated on the whole BPC cohort, annotated smoking status from clinician notes (ACC=0.95), and Gleason score (ACC=1.0), PD-L1 status (ACC=0.98), and mismatch repair deficiency (ACC=0.98) from histopathology reports. NLP annotations were merged with genomic and other structured clinical data to create a Clinicogenomic, Harmonized Oncologic Real-world Dataset (MSK-CHORD). Finally, we tested if associations not apparent in the BPC might be discoverable in MSK-CHORD. We found positive associations between Gleason score and gene-level alterations in prostate cancer including TP53, PTEN and BRCA2 (q<0.1), none of which were adequately powered for detection in the BPC. We found PD-L1 status was associated with better survival following immunotherapy treatment in NSCLC, but only in the larger MSK-CHORD was this association statistically significant. In breast cancer, NF1 mutations were associated with prior therapy in both cohorts, but this association was only significant in MSK-CHORD. The infrastructure generating MSK-CHORD uses a combination of on-premise and cloud computing resources and open-source development operation applications to automate processes. Once annotations are created, data is imported into a local instance of cBioPortal, where researchers can visualize data and perform analyses. The system generating MSK-CHORD demonstrates how large-scale data delivery and integration can fuel cancer research. Citation Format: Christopher J. Fong, Karl Pichotta, Thinh Tran, Michele Waters, Tom Fu, Mono Pirun, Mirella Altoe, Brooke Mastrogiacomo, Anisha Luthra, Mehnaj Ahmed, Arfath Pasha, Armaan Kohli, Raymond Lim, Tom Pollard, Darin Moore, Benjamin Gross, Avery Wang, Calla Chennault, Ritika Kundra, Ramyasree Madupuri, Ino de Bruijn, Aaron Lisman, Walid K. Chatila, Subhiksha Nandakumar, Anika Begum, Doori Rose, Kenneth L. Kehl, Deborah Schrag, Michael Berger, Jian Carrot-Zhang, Pedram Razavi, Bob Li, Peter Stetson, Nikolaus Schultz, Justin Jee. Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD) [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 3892.

Unsupervised Extraction of Phenotypes from Cancer Clinical Notes for Association Studies

Enabling scalable clinical interpretation of ML-based phenotypes using real world data

Automated real-world data integration improves cancer outcome prediction

Identifying Associations between Somatic Mutations and Clinicopathologic Findings in Lung Cancer Pathology Reports

Leveraging Genetic Reports and Electronic Health Records for the Prediction of Primary Cancers: Algorithm Development and Validation Study

A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

Leveraging a Joint of Phenotypic and Genetic Features on Cancer Patient Subgrouping

Towards Structuring Real-World Data at Scale: Deep Learning for Extracting Key Oncology Information from Clinical Text with Patient-Level Supervision

Automated feature selection of predictors in electronic medical records data

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

Synergizing Data Imputation and Electronic Health Records for Advancing Prostate Cancer Research: Challenges, and Practical Applications

An innovative solution for breast cancer textual big data analysis

Extraction, Labeling, Clustering, and Semantic Mapping of Segments From Clinical Notes

Approach to machine learning for extraction of real-world data variables from electronic health records

Abstract 3892: Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD)

Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.

Unsupervised extraction, labelling and clustering of segments from clinical notes

Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

Abstract 4090: Creating research quality cancer genomic data from electronic health records

Investigating Alternative Feature Extraction Pipelines For Clinical Note Phenotyping

Development and Validation of a Natural Language Processing Algorithm for Extracting Clinical and Pathological Features of Breast Cancer From Pathology Reports