Abstract:The NCI's The Cancer Genome Atlas (TCGA) project profiled over 10,000 tumor samples over the course of 10 years. As different tissue-specific working groups reviewed all of the available data, these patient samples were separated into distinct molecular subtypes, and these clusters were reported in various marker papers. While these assignments provided invaluable information about the common patterns of molecular characteristics in different types of cancer there was no consistent methodology for assigning new samples to these defined molecular subtypes.The NCI's Tumor Molecular Pathology group was formulated to create machine learning-based models that could be applied to non-TCGA samples and determine their TCGA mapped subtypes. Five modeling systems, JADBio, SKGrid by the Oregon Health and Science University, CloudForest by the Institute of Systems Biology, AKLIMATE by University of California Santa Cruz and subSCOPE by BC Cancer's Genome Sciences Centre, were trained to recognize TCGA subtypes using multi-omic measurements from gene expression, DNA methylation, miRNA expression, copy number, and somatic mutation calls. While the TCGA samples were profiled using multi-omic technologies, single platform and/or compact feature set models also were assessed for their ability to assign these classifications. Each machine learning system created predictive models for 106 subtypes from 26 cancer types using as few features as possible, with a maximum of 100 features allowed for scored models. A set of 411,706 models was developed, composed of results of each of the learning methods across the various omic platforms. Top models, both multi-omic and single platform, were selected for each cancer type. On average, models were able to achieve an overall weighted F1 score of 0.895 with 42 features. While the top models for each cancer type had an overall weighted F1 mean performance of 0.936 with a mean of 29 features, in 20 of the 26 cancer types models using only gene expression provided the best performance. Analysis of features selected by the models showed some known onco-drivers were selected by many models, but many times different models would utilize features of different genes with similar levels of performance. Network-level analysis revealed that many genes of these selected features operated within the same pathways.Transferability of these models to external datasets was tested, taking TCGA breast cancer trained models and applying them to AURORA and METABRIC datasets. Interestingly, despite the data platform difference between TCGA (RNAseq) and METABRIC (microarray), model performance saw only minimal degradation of F1 values in transfer. This set of models and the training dataset will provide new opportunities for researchers and translational scientists to connect new tumors to the subtypes seen in the TCGA cohorts. Citation Format: Kyle Ellrott, Chris K. Wong, Christina Yau, Mauro A. Castro, Jordan Lee, Brian Karlberg, Jasleen K. Grewal, Vincenzo Lagani, Bahar Tercan, Verena Friedl, Toshinori Hinoue, Vladislav Uzunangelov, Lindsay Westlake, Xavier Loinaz, Ina Felau, Peggy Wang, Anab Kemal, Samantha J. Caesar-Johnson, Ilya Shmulevich, Alexander J. Lazar, Ioannis Tsamardinos, Katherine A. Hoadley, The Cancer Genome Atlas Analysis Network, Gordon A. Robertson, Theo A. Knijnenburg, Christopher C. Benz, Joshua M. Stuart, Jean C. Zenklusen, Andrew D. Cherniack, Peter W. Laird. Leveraging compact feature sets for TCGA-based molecular subtype classification on new samples [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 6548.

Abstract 7352: ScanCT: A tree-based machine learning model to detect single-cell genomic features associated with clinical outcomes

Abstract 878: Enhancing single-cell RNA sequencing analysis in cancer research: A machine learning framework based on LightGBM for automated cell type annotation

Abstract 5380: an Integrated Platform for the Clinical Detection and Molecular Profiling of Single Circulating Tumor Cells

Abstract A029: Evaluating and interpreting scGPT: A foundation model for single-cell biology in real-world cancer clinical trial data

Abstract 1700: Serial monitoring of single-cell circulating tumor cell genomics in metastatic lobular breast cancer to identify precision and immuno-oncology biomarkers with therapeutic implications

Abstract 2334: Computational methods for optimizing marker selection, clonal lineage reconstruction, and longitudinal tracking of clonal dynamics via circulating tumor DNA (ctDNA)

Abstract 2390: Longitudinal assessment of circulating tumor DNA in patients with advanced colorectal cancer: A proposed general statistical framework and visualization tool

Abstract 6202: Integrating public single-cell transcriptomics and patient profiles to guide clinical development

Abstract 6291: Integrative single-cell tracking of genome evolution and tumor cell plasticity in small cell lung cancer (SCLC)

Abstract 6931: Characterization of cancer evolution landscape based on accurate detection of somatic mutations in single tumor cells

Abstract 6925: Integrative single-cell tracking of genome evolution and tumor cell plasticity in small cell lung cancer (SCLC)

Abstract 5721: Automated annotation for large-scale clinicogenomic models of lung cancer treatment response and overall survival

Abstract 3694: Genomic profiling of single cancer cells using the novel single-cell integrated mutational profiling of actionable cancer targets (scIMPACT) assay

Abstract 7497: Advancing multi-biomarker CTC assay and CDx development through the automated GenoCTC system

Abstract 6548: Leveraging compact feature sets for TCGA-based molecular subtype classification on new samples

Detecting phenotype-specific tumor microenvironment by merging bulk and single cell expression data to spatial transcriptomics

Abstract 5552: Uncovering clinically significant tumor microenvironment interaction programs across diverse cancers

Abstract 2059: Machine learning integration of transcriptome-wide spatial sequencing data and ultra-high plex spatial proteomic data enables the prioritization of cancer drug targets

Abstract 1906: Single cell precision oncology: A proof of principle

Abstract 5440: Deep-learning model for tumor type classification enables enhanced clinical decision support in cancer diagnosis

Abstract 2326: Integrating single nucleotide variants (SNVs), copy number alterations (CNAs), and structural variants (SVs) into single-cell clonal lineage inference